Speech recognition and asthma healthcare

Asthma is a common lung condition. Inhaled corticosteroids are a safe and effective treatment, but many people don’t take their medication as prescribed. We used AWS speech-to-text and machine learning as a way to improve the quality of care for people with asthma.

Speech recognition can help evaluate shared decision making and predict medication adherence in primary care setting


Map anti semitism en Youtube

Commission nationale consultative des droits de l'homme (CNCDH)

Cartographier l’empreinte antisémite sur YouTube

As part of the 30th annual report of the Commission nationale consultative des droits de l’homme (CNCDH) on racism, anti-Semitism, and xenophobia, the médialab Sciences Po has undertaken a collective project to map anti-Semitic discourses on YouTube. The CNCDH report, published on July 8, 2021, highlights a relatively small but significant presence of anti-Semitism, despite the platform’s prior efforts to address it.


Youtube recommender system

The Effects of Disaggregated Factors of YouTube Recommendations in Diversity International Conference on Computational Social Science IC2S2, Jul 2020,

Keywords: YouTube, algorithmic recommendations, diversity of recommendations, agent-based models, consequences of recommender systems.

The aim of this study was to examine how different relationships between YouTube channels can impact the diversity of content consumed by users. The researchers used agent-based simulations to investigate the effects of prioritizing relationships in the recommendation system. The study focused on a wide collection of over 1400 YouTube channels from France. The dataset encompassed various types of channels, including those affiliated with political parties, public figures, news organizations, independent media, companies, and social movements.

Read the article here


Data science at InfoQ

In 2018, I joined the editorial team at InfoQ to write on what’s happening in data science. Recent news items include:

Machine Learning on Google Cloud - Packt Publishing - April 2018

I co-authored the book Hands-On Machine Learning on Google Cloud Platform.

Tutorials on Datacamp

A couple of tutorials on the Datacamp blog:


Effective Amazon Machine Learning - Packt Publishing - April 2017

This book focuses on the recent Amazon Machine Learning service. The service is intentionally simple to use and the book follows that philosophy. It is composed of three parts:

Effective Amazon Machine Learning Book Cover

  • A thorough intro into Data Science for the new data scientist. In the first 2 Chapters, I present the minimal and necessary concepts in Predictive Modeling: classification, regression, metrics, bias and variance as well as feature engineering tactics.

  • Starting with a brief overview of the AWS Machine Learning service in Chapter 3, I explore the service in-extenso in Chapter 4 to 6: Loading the data using S3, Building a model and Assesssing the predictive power of the model. Throughout the book I use classic datasets such as the Titanic dataset which is particularly adapted to creative Feature engineering.

  • In the 3rd part, we gear up and star using a python SDK or the AWS CLI command line interface to upload data, build and assess models. Moving to scripting and the command line allows us to implement cross validation and recursive feature selection. I close the book by showing you how to create a full data pipeline built around AWS lambda, Redshift and AWS Machine Learning with twitter as the source for a fun sentiment analysis project.

My goal in writing this book has been to go beyond the AWS Machine Learning service and to offer to the reader other efficient tools and methods that can be useful in a day to day data science / ETL workflow. For instance by showing how to leverage SQL queries and simple Bash Shell scripting to perform feature extraction and feature engineering upstream.

The book is available on Amazon, on the Packtpub website and at Safari books.

The github repo associated with the book contains all the code and datasets used in the book, organised by chapters.


Articles on ODSC and KDnuggets

I had the pleasure of working with the amazing team at ODSC in 2016 and take part in the organizations of the ODSC conferences. I also wrote a few articles for them that enjoyed aquite a bit of traffic.

  • Riding on Large Data with Scikit-learn: When you data consumes all the memory of your laptop but does not qualify as Big Data, the trick is to use the Out-of-core mode of scikit-learn’s algorithms and stream your data to the model chunks by chunks. This batch processing intriduces extra complexity and parameters that you need to be aware of.

  • Dissecting the Presidential Debates with an NLP Scalpel Back in 2015, the presidential debates were going full speed with 13 Republican candidates and 5 Demicrat candidates. In this post I explore several NLP technics to decrypt the debates in terms of topics, sentiment, candidate dynamics and summarization.

  • Jupyter, Zeppelin, Beaker: The Rise of the Notebooks With the rise of the iPython Notebooks, soon to become Jupyter notebooks, it was interesting to uncover other notebook projects such as the beaker notebook, the Apache Zeppelin notebook and of course the venerable Sage Notebooks.

  • Open Source and Data Science, a perfect match

  • on KDNuggets.com: Amazon Machine Learning: Nice and Easy or Overly Simple? AWS had just launched its new Machine Learning Service and I could not resists but try to find out how it worked and performed. This post was the precursor of my book on the subject.

Behind the Scenes with MOOCs: Berklee College of Music’s Experience Developing, Running, and Evaluating Courses through Coursera

The Journal of Continuing Higher Education 77:136 · January 2013

Back in 2013, MOOCs were still a novelty. We carried out an analysis of the behavior of our online students with a focus on engagement. completion and final scoring.

You can download the article.