Pino bio photo

Pino

I (infrequently) blog about data science, business intelligence, big data, web technologies and free software.

Twitter LinkedIn

This list is for those who want to become more comfortable working with machine learning libraries, numbers and statistics.

I’ve been making a list of MOOCs related to data science for a while. But it’s not always easy to find one that fits your professional interests. Not to mention that with the following resources, you can progress at your own pace.

In the end, this is just a compilation of links to posts, tutorials, video lectures and workshops that either I’ve shared on Twitter in the last couple of years or that I have in my bookmarks.

Machine learning

I selected those resources that are more suitable for beginners together with the parts of machine learning that I like the most.

Statistics

Python

Python is my go-to language for most things these days and I got asked very frequently what are the best resources to learn programming in this language from scratch. Here it is my list:

  • Codecademy has a Python track that seems suitable for both programmers who want to switch a new language and beginners to programming.
  • If you prefer a book to learn programming, Dive into Python is the only one that I can recommend.
  • I didn’t want to list any MOOC here, but I’ll make an exception with this Python course.

Once you are familiar with Python, the following resources for machine learning and data analysis can take your skills to the next level:

R

I’ve been trying hard to like R. It’s been in fact more than 5 years of trying to like it and I just simply prefer Python. In any case, I still frequently launch an R prompt to use some fantastic packages that R has.

Databases and SQL

“If you’re doing data science/analysis, learn SQL. People hate on it, but it’s important. World of tech is built on it.” Greg Reda

Spark, Hadoop and distribute computing

If you have to deal with large volumes of data, these are the resources that I can recommend:

  • Data Science training with Spark, a 5+ hour video from the Spark Summit.
  • The DatasFrames API will be the cornerstone of the future of Apache Spark.
  • Distributed systems theory is neither easy nor well documented. This post covers many important concepts for distributed systems engineer.
  • Mining of Massive Datasets is a must-read book based on Stanford Mining Massive Dataset and Data Mining courses. And it’s free!
  • Spark SQL is probably the best SQL interface for data stored in HDFS. This paper explains some relevant concepts related to it like data frames, the integration with the optimizer Catalyst integrated and a performance comparison with Shark.
  • Introduction to Spark SQL and its rule-based optimizer by Michael Armbrust
  • Databricks’s reference apps is a comprehensive set of examples to learn Spark.
  • Presentation on Parquet, the easiest way to columnar storage in Hadoop, at Hadoop Summit.
  • Apache Flink is more suitable than Spark for iterative and also streaming processes. The best resources to learn about it is the official YouTube channel and DataArtisans YouTube channel.
  • You can learn from this post what Apache Kafka is and how it can be used at large scale.

Applying data science to your organization

To end with, some examples on how data science and machine learning can be used to add value to your organization:

That’s all

This list is obviously bias toward my preferences and experience. Moreover, I realised that some interesting topics as data visualization and experiment design are not properly covered. That’s why any suggestion in the comments of this post is more than welcome.