Pino bio photo

Pino

I (infrequently) blog about data science, business intelligence, big data, web technologies and free software.

Twitter LinkedIn

Last weekend I shared the following slide from Joel Grus (author of the book Data Science from Scratch):

Joel was giving his own version of a famous quote by Robert A. Heinlein, highlighting, with humour, the wide range of tasks that data scientists perform. Because behind every joke there’s always some truth, I decided to compile this list of tasks regularly performed by data scientists.

Statistics and algebra

  • Test a hypothesis (Joel Grus)
  • Update a prior (Joel Grus)
  • Measure the statistical dependence between two variables.
  • Factor matrices (Joel Grus)
  • Explain and interpret p-values
  • Apply regression analysis.

Machine learning

  • Understand the differences between different classification and clustering methods.
  • Implement a machine learning method vaguely described in a paper.
  • Run a regression (Joel Grus)
  • Pretend to understand deep learning (Joel Grus)
  • Evaluate the performance of machine learning algorithms and use different error metrics
  • Design an experiment (Joel Grus)
  • Build a recommendation system.
  • Extract knowledge with stream mining algorithms from data that you can’t afford to process more than once.
  • Engineer features from any existing dataset.

Programming

  • Program in R and Python (and Julia?).
  • Write shell scripts (Joel Grus)
  • Architect your software
  • Write automated test cases

Data management

  • Write a SQL query (Joel Grus)
  • Solve complex data aggregation problems.
  • Pivot a dataframe
  • Clean up messy data (Joel Grus)
  • Optimize SQL queries and understand physical query plans.

Large volumes of data

  • Think in mapreduce (Joel Grus)
  • Set-up a Hadoop cluster
  • Write massively parallel algorithms

Domain expertise

  • Understand a business challenge and define data-related ways to solve it altogether (@totopampin)
  • Measure the business impact of your work
  • Use data visualization to empower and inspire stakeholders

Data extraction

  • Scrape a web site (Joel Grus)
  • Retrieve data from any API
  • Measure the difference between two sequences
  • Integrate data from different sources
  • Process natural language

Please leave your suggestions in the comment section below!