Long story short, I am a Data Science consultant with 20+ years experience in data analytics, software engineering and project management.
Data science is the discipline of extracting actionable information from data. It’s a wide domain ranging from classic data analysis and statistical inference to predictive analytics and unsupervised learning. The methods I use range from simple linear regression to bayesian inference and neural networks.
The success of a data science project depends on 3 elements: questions, data and communication:
- Business questions must be translated into data projects with specific goals and metrics.
- Data has to be accessible and clean but most of all, relevant!
- Communication drives project visibility. Stakeholders must have a clear understanding of the main takeaways at each step of the project.
I am available for consulting on all aspects of data science including:
- Predictive analytics and statistical inference: building bespoke models for specific applications
- Forecasting: going beyond classic linear modeling for non-stationnary time series.
- Natural Language Processing (NLP): applying the latest innovations in NLP to extract information from large corpuses, analyze social networks and automate text generation.
- Data science coaching: helping data teams set up their experiments, understand their results and implement stable data analytics workflows.
- Machine learning in the cloud: leveraging the fast evolving landscape of artificial intelligence in the cloud with services from AWS, Google Cloud and Azure.
I am always happy to talk about your projects and ideas.
So do get in touch for a conversation.
Intro to NLP, new course on Openclassrooms
A new Intro to NLP course I am super excited for the launch of my new Intro to NLP course on Openclassrooms. This could not have happened without OC’s amazing team, with a special shoutout to Alexandra. :) The course covers basic BOW to static embeddings, glove style, with NLTK,...
Best practices when sharing your data analysis - Jupyter Notebooks
Tips to make you data analysis easier to share Context: You work on a large dataset let’s say over 1Gb. You do an analysis. And you want to share the data the script / jupyter notebook so that other people can work / reproduce / tweak your results. Here are...
New statistical learning course on openclassroooms
I am very excited to announce that my new course on statistical learning is now available on openclassrooms. Design Effective Statistical Models to Understand Your Data In this course I explore linear, logistic and polynomial regression with hands on exercises, real-world use-cases and non trivial datasets. Regression is the mother...
Teaching Data Science at UM6P
Teaching Data Science is demanding, often intense, sometimes exhausting but always an enriching and extremely rewarding experience. There are magical moments of teacher-student resonance when you can feel the knowledge flowing across the room. I had the chance of teaching a 2-week session of predictive analytics in Morocco in the...
Reduce GPU costs with startup scripts on the Google Cloud Engine
Reduce GPU costs with on demand instances and startup scripts This post is about leveraging on demand capabilities of costly virtual instances on the Google Cloud Engine using startup scripts. Deep Learning is expensive Here’s the situation: You’re working on some large dataset, and you feel the irresistible urge to...
iPhone addiction? Get a grip!
Is it a DNA thing? My wife has a super power! She is totally immune to the constant nagging of her iPhone. She has an amazing ability to resist checking her emails every 5 minutes, texting back on the spot and playing the whack-a-notification game all day long. Maybe it’s...
Top gsutil command lines to get started on Google Cloud Storage
Google storage is a file storage service available from Google Cloud. Quite similar to Amazon S3 it offers interesting functionalities such as signed-urls, bucket synchronization, collaboration bucket settings, parallel uploads and is S3 compatible. Gsutil, the associated command line tool is part of the gcloud command line interface. After a...
AutoML on AWS
Build a predictive analytics pipeline in a flash When Bayesian optimization meets the Stochastic Gradient Descent algorithm on the AWS marketplace, rich features bloom, models are trained, Time-To-Market shrinks and stakeholders are satisfied. In this article, we present an AWS based framework which allows non technical people to build predictive...
gsutil is Google Storage CLI tool. Equivalent to aws s3 but for the Google Cloud Platform, it allows you to access Google Cloud Storage from the command line. Beyond moving files and managing buckets, gsutil is a powerful file management (rsync) and file publication tool (signed urls). Please find below...
AWS Machine Learning Big Data NYC
My slides on AWS Machine Learning platform at the Global Big Data conference - NYC 2017 Oct 24. tl;dr: The AWS Machine Learning service is a simple but very efficient predictive analytics service for supervised classification and regression. The AWS ML service greatly simplifies the model selection and model optimization...
Workshop sur le Topic Modeling
J’ai eu le plaisir de mener récemment un workshop sur le topic modeling dans le cadre du Master Méthode computationnelle et analyse de contenu à l’Université Paris Est Marne la vallée. Il y a assez peu de ressources en français sur le topic modeling. Le seul résultat que j’ai pu...
Writing Effective Amazon Machine Learning
My article on the Amazon Machine Learning service first published on the ODSC blog and then republished on KDnuggets triggered a book project. Shortly after writing that article, I was contacted by Packt publishing to write an entire book on AWS Machine Learning service. Packt Publishing is well known for...
Large Data with Scikit-learn - Boston Meetup
### Large Data with Scikit-learn * Alexis Perrier - [@alexip](https://twitter.com/alexip) * Data & Software - [@BerkleeOnline](https://twitter.com/berkleeonline) - Day * Data Science contributor - [@ODSC](https://twitter.com/odsc) - Night ### Plan 1) What is large data? out-of-core, streaming, online, batch? 2) Algorithms 3) Implementation 4) Examples ### Many great alternatives * Dato: [GraphLab...
Paris Meetup slides Topic Modeling of Twitter Followers
### Topic Modeling #####appliqué aux fils twitters. * Alexis Perrier [@alexip](https://twitter.com/alexip) * Data & Software, Berklee College of Music, Boston [@BerkleeOnline](https://twitter.com/berkleeonline) * Data Science contributor [@ODSC](https://twitter.com/odsc) **Part I: Topic Modeling** * Nature et application * Algos et Librairies **Part II: Projet: followers sur twitter** * Methodes * Problemes * Viz...
Hands-on analysis of the Amazon Machine Learning service
Is the new Amazon Machine Learning too simple to reap the benefits of predictive analytics? Machine Learning as a Service (MLaaS) promises to put data science within the reach of companies. In that context, Amazon Machine Learning is a predictive analytics service with binary/multiclass classification and linear regression features. The...
Jupyter, Zeppelin, Beaker: The Rise of the Notebooks
One of the particularities of scientific computing is the need for experiments, explorations, and collaborations. This need is addressed by notebooks. Notebooks are collaborative web-based environments for data exploration and visualization — the perfect toolbox for data science. They help create reproducible, shareable, collaborative computational narratives. There are alternatives to...
Dynamics of Debates with Time Maps
2015 presidential debates The race for the presidential nomination for both parties is going full speed with a plethora of debates. At time of writing there has been 4 Republican debates and 2 Democratic ones. These debates have high impacts on the presidential nomination race with candidates dropping out and...
NLP Analysis of the 2015 presidential candidate debates
I’ve been fascinated by the recent presidential nomination debates. Their format, the number of participants, the post debates media frenzy all make for a good show. In the following 2 articles I’ve applied several powerful Text Mining and Natural Language Processing techniques to the transcripts. In this first article: Dissecting...
Scikit-learn's Out-of-Core Classifiers for Large Data
Here the scenario: A new kaggle competition, a new dataset. Gigabites? ouch! Cold shivers as you anticipate hours waiting to extract features, train models and middle of the night cold feet as you’re just checking that your script is still running. A data set is said to be large when...
Segmentation of Twitter Timelines via Topic Modeling
Following up on our first post on the subject, Topic Modeling of Twitter Followers, we compare different unsupervised methods to further analyze the timelines of the followers of the @alexip account. We compare the results obtained through Latent Semantic Analysis and Latent Dirichlet Allocation and we segment Twitter timelines based...
Topic Modeling of Twitter Followers
In this post, we explore LDA an unsupervised topic modeling method in the context of twitter timelines. Given a twitter account, is it possible to find out what subjects its followers are tweeting about? Knowing the evolution or the segmentation of an account’s followers can give actionable insights to a...
Feature Importance in Random Forests
Comparing Gini and Accuracy metrics We’re following up on Part I where we explored the Driven Data blood donation data set. The objective of the present article is to explore feature engineering and assess the impact of newly created features on the predictive power of the model in the context...
Blood Donation on DrivenData: Exploration
Blood Donation on DrivenData - Part I Exploration DrivenData.org is a machine learning competition web site similar to the better known Kaggle.com site with a different angle. It focuses on leveraging Data Science for social issues. And it’s based in Boston! For the learning Data Scientist, DrivenData offers a good...
subscribe via RSS