Last weekend I shared the following slide from Joel Grus (author of the book Data Science from Scratch):
What data scientists do? by @joelgrus pic.twitter.com/XydGa2yVax
— José Luis López Pino (@jllopezpino) February 20, 2016
Joel was giving his own version of a famous quote by Robert A. Heinlein, highlighting, with humour, the wide range of tasks that data scientists perform. Because behind every joke there’s always some truth, I decided to compile this list of tasks regularly performed by data scientists.
Statistics and algebra
- Test a hypothesis (Joel Grus)
- Update a prior (Joel Grus)
- Measure the statistical dependence between two variables.
- Factor matrices (Joel Grus)
- Explain and interpret p-values
- Apply regression analysis.
Machine learning
- Understand the differences between different classification and clustering methods.
- Implement a machine learning method vaguely described in a paper.
- Run a regression (Joel Grus)
- Pretend to understand deep learning (Joel Grus)
- Evaluate the performance of machine learning algorithms and use different error metrics
- Design an experiment (Joel Grus)
- Build a recommendation system.
- Extract knowledge with stream mining algorithms from data that you can’t afford to process more than once.
- Engineer features from any existing dataset.
Programming
- Program in R and Python (and Julia?).
- Write shell scripts (Joel Grus)
- Architect your software
- Write automated test cases
Data management
- Write a SQL query (Joel Grus)
- Solve complex data aggregation problems.
- Pivot a dataframe
- Clean up messy data (Joel Grus)
- Optimize SQL queries and understand physical query plans.
Large volumes of data
- Think in mapreduce (Joel Grus)
- Set-up a Hadoop cluster
- Write massively parallel algorithms
Domain expertise
- Understand a business challenge and define data-related ways to solve it altogether (@totopampin)
- Measure the business impact of your work
- Use data visualization to empower and inspire stakeholders
Data extraction
- Scrape a web site (Joel Grus)
- Retrieve data from any API
- Measure the difference between two sequences
- Integrate data from different sources
- Process natural language
Please leave your suggestions in the comment section below!