75+ free online resources to boost your data science and analysis skills

This list is for those who want to become more comfortable working with machine learning libraries, numbers and statistics.

I’ve been making a list of MOOCs related to data science for a while. But it’s not always easy to find one that fits your professional interests. Not to mention that with the following resources, you can progress at your own pace.

In the end, this is just a compilation of links to posts, tutorials, video lectures and workshops that either I’ve shared on Twitter in the last couple of years or that I have in my bookmarks.

Machine learning

I selected those resources that are more suitable for beginners together with the parts of machine learning that I like the most.

You can start with this introduction to data mining by Saed Sayad (University of Toronto). I found the first diagram particularly interesting.
This glossary of machine learning terms is the best that I’ve found so far.
An introduction to machine learning in 10 pictures is a short still great article to start with.
Xavier Amatriain, one of the minds behind Netflix’s famous recommendation system, explains what are the advantages of different classification algorithms.
Don’t miss this list of machine learning podcasts.
Introduction to Recommender Systems is a 4-hour lecture of the 2014 Machine Learning Summer School at CMU. You can find other interesting machine learning lectures from the same summer school and other programs in Alex Smola’s YouTube channel.
The Elements of Statistical Learning is a classic book ideal to understand the foundations of many machine learning methods.
This PyData talk is a good introduction to deep neural networks in general and to convolution networks in particular.
This paper explains why using AUC to evaluate the performance of machine learning algorithms.
Tex Mining with WEKA cookbook for those who prefer Java.
“Machine Learning Gremlins” is a presentation on common machine learning mistakes by Ben Hamner (Kaggle).
Because we don’t always need exact answers, this introduction to stream mining by Mikio Braun can be very useful to you.
If you want a wider vision of artificial intelligence, these lectures from the AI course taught at MIT by Patrick Winston.
The lectures of the course “CS273a: Introduction to Machine Learning” by Prof. Alex Ihler (UCI) are available on Youtube.
Choosing a machine learning model can be a cumbersome task. That’s why we have automatic machine learning to assist model selection. These slides are a good entry point to it.
For a good picture of the state of the art of neural networks and deep learning, you can find tutorials and workshops of the NIPS 2014 conference in this YouTube channel. You can also find this summary of the conference by John Platt (Microsoft Research).
You can find a good explanation of SGD and ALS for matrix factorization in this Quora thread.
Applied Data Mining and Statistical Learning (Pennsylvania State University).
Stanford CS231n: Convolutional Neural Networks for Visual Recognition: excellent slides and videos to learn about CNN.

Statistics

Pretty handy resource to explain statistical significance: how to Assess Statistical Significance.
Top 10 big ideas covered in the Probability course at Harvard by Joe Blitzstein. You can also watch on Youtube the lectures of this course.
Learn more about errors in hypothesis testing (statistical significance and power) from this lecture on Data Collection and Statistical Inference by Aaron Gullickson.
What to do when data is missing? Learn what statisticians working in clinical trial field do.
Introduction to Time Series Analysis from the book Engineering Statistics.
This article talks about how to optimize decisions beyond A/B testing, including an introduction to the multi-armed bandit problem and the epsilon-greedy strategy.
Jeff Rajeck has a series of posts titled “using data science with A/B tests”. I particularly enjoyed the one covering Bayesian analysis.
A game that simulates A/B tests and challenges you to make the right decisions: So You Think You Can Test?.
Brian Caffo is one of the lecturers of the Data Science specialization on Coursera and his YouTube channel is full of resources to learn statistics.
Some statistical concepts that data scientists usually overlook by Chris Fonnesbeck at SciPy 2015.

Python

Python is my go-to language for most things these days and I got asked very frequently what are the best resources to learn programming in this language from scratch. Here it is my list:

Codecademy has a Python track that seems suitable for both programmers who want to switch a new language and beginners to programming.
If you prefer a book to learn programming, Dive into Python is the only one that I can recommend.
I didn’t want to list any MOOC here, but I’ll make an exception with this Python course.

Once you are familiar with Python, the following resources for machine learning and data analysis can take your skills to the next level:

Video tutorials to learn how to use Python’s scikit-learn library to perform machine learning by Kevin Markham.
3h+ in-depth introduction to machine learning with scikit-learn by Kyle Kastner (Université de Montréal) and Andreas Mueller (NYU Center for Data Science).
Machine learning cheat sheet for scikit-learn by Andreas Mueller.
If you are interested in using neural networks in Python, Daniel Nouri explains how to solve the Facial Keypoint Detection Kaggle challenge using Lasagne.
If you don’t have a technical background, you’ll find very useful the scripts that you can find in Practical Business Python.
Notebook Gallery: links to the best IPython and Jupyter notebooks submitted by users.
Recipes of the IPython Cookbook include excellent examples of how to use NumPy, scikit-learn and many other packages.
Code snippets of some of the most common operations with Pandas.
Make your first machine learning predictions using Python with this Kaggle tutorial.
NLTK is the most popular library for natural language processing in Python. This presentation can give you a good overview of what you can do with it and this 1 hour tutorial will show you what you can do with it.
PyDataTV is the YouTube channel of the PyData conferences. You can find keynotes, talks and workshops on how to use the PyData stack.
mpld3: interactive Matplotlib graphics in the browser and in IPython notebooks
EarthPy is a collection of IPython notebooks with examples of Earth Science data processing.

R

I’ve been trying hard to like R. It’s been in fact more than 5 years of trying to like it and I just simply prefer Python. In any case, I still frequently launch an R prompt to use some fantastic packages that R has.

Intro to R is a playlist by Google Developers that explains all the basics of the language.
Kaggle top ranker Xavier Conort listed 10 R Packages to win Kaggle competitions. That’s a good way to discover some very prominent R packages.
An Introduction to Statistical Learning with Applications in R is a terrific free book full of examples.
“R: the good parts” is an article by Jose Quesada (Data Science Retreat) that lists the main advantages of R with links to other good resources.
Archetypal analysis is not usually taught in introductory machine learning courses. This post explains how to apply it and shows that it outperforms kmeans in a number of cases. Plus, archetypal analysis is easier to interpret.
AnomalyDetection and BreakoutDetection: open source R packages for time-series analysis by Twitter.
qdap is not only one of the best packages for natural language processing in R, but also one of the best documented. Use the vignette to get started with it and later on the manual.
I know from my own experience that R’s memory limitations can give you a headache. These tricks are sometimes an effective painkiller. Also the slides “Taking R to the Limit: Large Datasets” might help.
statsTeachR is a repository of lessons for teaching statistics using R.
Make your first machine learning predictions using R with any of these four tutorials.

Databases and SQL

“If you’re doing data science/analysis, learn SQL. People hate on it, but it’s important. World of tech is built on it.” Greg Reda

If you want to know how a DBMS works, check the videos of the Standford Introduction to Databases course by Jennifer Widom.
Postgres has a feature-rich implementation of the SQL language and it took me about 1 year to master most of its features. You can find a very good overview of many of these advanced features of the languages in this video: Postgres: The Bits You Haven’t Found
Apache Calcite is a query planner and optimizer for any kind of data sources. If you want to get familiar with how a 21st century query planner and optimizer looks like, this is a good starting point.
Crosstabulation and pivot tables in Postgres are less arduous with the tablefunc module.
If you are a beginner to SQL and MySQL is your choice, try this tutorial that claims to cover 95-98% of everything you’ll ever need to know in MySQL or this playlist.

Spark, Hadoop and distribute computing

If you have to deal with large volumes of data, these are the resources that I can recommend:

Data Science training with Spark, a 5+ hour video from the Spark Summit.
The DatasFrames API will be the cornerstone of the future of Apache Spark.
Distributed systems theory is neither easy nor well documented. This post covers many important concepts for distributed systems engineer.
Mining of Massive Datasets is a must-read book based on Stanford Mining Massive Dataset and Data Mining courses. And it’s free!
Spark SQL is probably the best SQL interface for data stored in HDFS. This paper explains some relevant concepts related to it like data frames, the integration with the optimizer Catalyst integrated and a performance comparison with Shark.
Introduction to Spark SQL and its rule-based optimizer by Michael Armbrust
Databricks’s reference apps is a comprehensive set of examples to learn Spark.
Presentation on Parquet, the easiest way to columnar storage in Hadoop, at Hadoop Summit.
Apache Flink is more suitable than Spark for iterative and also streaming processes. The best resources to learn about it is the official YouTube channel and DataArtisans YouTube channel.
You can learn from this post what Apache Kafka is and how it can be used at large scale.

Applying data science to your organization

To end with, some examples on how data science and machine learning can be used to add value to your organization:

Jeff Leek (Johns Hopkins University) shared some interesting learnings in his post 10 things statistics taught us about big data analysis.
What Data Science can do for entrepreneurs? Growth, retention, product customization and marketing optimization
How to start data science initiatives in a lean and cost-effective way
This paper explains how Booking.com uses crowdsourced data and machine learning to suggest the destination of the next trip.
Recommendations done wrong. TripAdvisor launched a new recommendation feature and you can see in the comments of this post how much negative feedback they received.
Someone asked how Quora uses machine learning and answers are very representative of how a website can benefit from using it.
Airbnb guest requests are 4% more likely to be accepted after they used collaborative filtering to predict host’s behavior. They also used data to understand what their users want and show more relevant results to them.
This paper describes how to do customer segmentation for customer retention using decision trees.
How Spotify uses deep learning to recommend music is well documented in this post.
How Google transcribed house numbers from Street View using neural networks.
Predicting consumer credit-risk performance at the beginning of the economic recession using machine learning (Paper).

That’s all

This list is obviously bias toward my preferences and experience. Moreover, I realised that some interesting topics as data visualization and experiment design are not properly covered. That’s why any suggestion in the comments of this post is more than welcome.

Pino