Data Science 101 – is there a “consensus curriculum”?

Avatar photo Dr. Aimee Schwab-McCoy

Our previous posts in the Data Science Curriculum series have covered computing competencies and program outlines for data science majors. In our final post, we’ll explore the introduction to data science course.

Data science is still a young discipline

The earliest data science courses are only about 10 years old, which means that introductory data science courses vary widely from school to school. Unlike first programming or statistics courses, a true consensus curriculum for data science has yet to emerge. However, some topics are widely taught at the introductory level.

In Fall 2019, researchers at Creighton University sent a survey to mathematics, statistics, and computer science faculty asking them to indicate which of 34 topics were covered in their intro data science course. 68 faculty responded and completed the topic ranking. 

The most common topics listed were:

DescriptionProportion of courses
Exploratory data analysis82%
Data cleaning and wrangling75%
Data ethics and responsible data use63%
Data curation and data quality53%
Linear and logistic regression53%
Reproducible research51%
Data lifecycle and data collection50%
Research methods41%
Data architecture, data types, and data formats40%
Text mining40%
Customizing data visualizations40%
Supervised machine learning38%

Data exploration, data wrangling, and data ethics were the three most common topics in the intro data science course

Basic models like linear and logistic regression, the data lifecycle, and data types were also important. Supervised machine learning algorithms and applications like text mining and custom visualizations rounded out the most common topics.

Some topics were ranked high as covered in the data science curriculum, but not necessarily the introduction course, including:

  • Linear algebra: matrix manipulation, eigenvalues, singularity (74% covered elsewhere)
  • Traditional statistical inference: hypothesis tests, confidence intervals (66%)
  • Relational and non-relational databases (59%)
  • Experimental design, modeling, and planning (57%)
  • Simulation-based inference: bootstrapping, randomization tests (53%)
  • Optimization and numerical algorithms (53%)
  • Systems engineering and software engineering principles (51%)
  • Unsupervised machine learning (47%)
  • Big data technologies: batch and parallel processing (46%)
  • Supervised machine learning (41%)
  • Cloud computing (41%)

Data science courses have continued to evolve since 2019, so some topics may be more or less important in your course. The Data Science Foundations zyBook covers all of the essential data science topics and more, allowing you to customize your curriculum. Please visit the How to Teach Data Science – zyBooks Guide for additional resources and best practices.

For more information, check out the original study:Aimee Schwab-McCoy, Catherine M. Baker & Rebecca E. Gasper (2021) Data Science in 2020: Computing, Curricula, and Challenges for the Next 10 Years, Journal of Statistics and Data Science Education, 29:1, S40-S50, DOI: 10.1080/10691898.2020.1851159

Avatar photo
Author Bio

Dr. Aimee Schwab-McCoy

Before joining zyBooks, Aimee was a statistics professor at Creighton University, where she created a Data Science program. Aimee is an experienced statistics and data science education researcher and passionate about developing engaging resources for data science learners.