Building Strong Data Wrangling Skills from the Ground Up
A practical guide for data science instructors on teaching concepts before code. Watch the recording.
The Reality Check: Data Wrangling Dominates Real-World Work
Here’s a sobering statistic that’s been true for over a decade: data scientists spend between 50% and 80% of their time collecting and preparing data before they can even begin analysis. Yet in many of our courses, data wrangling gets treated as an afterthought—a quick module before we dive into the “exciting” stuff like machine learning or visualization.
The truth is, our students won’t receive tidy, classroom-ready datasets in their careers. They’ll be lucky to get clean data at any point in their professional lives. So we need to prepare them not just to tolerate data wrangling, but to master it.
The Teaching Dilemma: Two Paths Forward
When it comes to teaching data wrangling, instructors typically fall into one of two camps:
Option 1: Jump Right In Pick a language (Python, R, Excel, whatever), open up the IDE, and start coding. Show students how to use pivot_table() or group_by() and let them figure it out through practice.
Option 2: Concepts First, Code Second Build conceptual understanding through visualizations and animations, then introduce the programming syntax once students grasp what’s actually happening to their data.

In our recent webinar with data science instructors from across the country—representing institutions from Miami Dade College to Cornell to BYU—we found educators split almost evenly between these approaches. One instructor noted that “most textbooks support option 1,” which certainly matches the landscape when many of us first started teaching data science.
Why Concepts-First Wins (Even If Students Get a Little Bored)
Let me share a common scenario from Option 1 teaching:
A student looks at this Python code:

They have questions. What is a pivot table? What does aggfunc=np.size actually do? But here’s the problem—they might not ask those questions. They’ll just copy the syntax, hope it works, and move on without truly understanding the underlying operation.
Compare this to the concepts-first approach: students first see an animation showing how a pivot table organizes data—one variable on rows, another on columns, counts in the cells. They watch observations being added up. They understand the mechanism before they see the code.
When you finally show them pd.pivot_table(), suddenly it clicks: “Oh, that’s step 3 from the animation!” The aggfunc parameter makes sense because they’ve already seen the aggregation happen visually.
The bottom line: Would you rather trade some early-course boredom for deeper conceptual foundations? Students who understand why they’re filtering, grouping, or aggregating data become better programmers AND better data communicators.
The Hidden Benefits: Transferability and Communication
The concepts-first approach delivers two critical advantages:
1. Skills Transfer Across Tools
When students deeply understand grouping as a concept, they can apply it whether they’re using pandas’ groupby(), R’s group_by(), or SQL’s GROUP BY. They’re not locked into memorizing one library’s syntax—they understand the underlying operation.
As one webinar attendee noted, they teach students to “assess student knowledge first before choosing options.” This adaptive approach recognizes that different student populations need different entry points.
2. Communication Skills
Data scientists must explain their work to stakeholders and colleagues who don’t code. If a student can articulate why they’re taking a particular data wrangling approach—not just how—they’re developing the communication skills that separate good data scientists from great ones.
Practical Tools for Teaching Data Wrangling
Regardless of which philosophy you lean toward, here are concrete strategies that work:
Interactive Animations
Replace static diagrams with step-by-step animations showing how data transformations actually work. Students can click through at their own pace, seeing exactly where each value comes from and how calculations occur. These should include alt text and descriptions for accessibility.
Embedded Jupyter Notebooks
Give students a friction-free environment to practice. If you’ve ever tried to help students install Python or R asynchronously, you know the pain. Embedded notebooks eliminate that early barrier—students can experiment with code immediately while building comfort with the tools.
One key feature: make notebooks fully editable. Let students change aggregation functions, modify filters, or try different approaches. They should be able to download their work as a zip file for local use later.
Scaffolded Challenge Activities
Move beyond simple multiple-choice questions. Use randomized, multi-level activities where students must complete Level 1 successfully before advancing to Level 2. This mastery-based approach ensures students actually understand concepts rather than just getting lucky.
Meaningful Feedback
Provide meaningful feedback for both correct and incorrect answers for all assessment types. When students choose the wrong option, explain the misconception that led there. This turns every question into a learning opportunity.
Labs with Hidden Test Cases
Create larger programming assignments (zyLabs) that assess multiple skills with multiple test cases. Critically, include hidden datasets that students can’t see—this helps prevent both cheating and over-reliance on AI tools that students might be tempted to use.
Track student behavior: how many submissions, how much code was pasted versus typed, time spent on the assignment. These metrics can flag unusual patterns worth investigating.
Addressing the AI Elephant in the Room
One webinar attendee asked directly: “Is AI an issue?”
Yes and no. AI tools are here to stay, so the question becomes: how do we teach in an AI-augmented world?
Strategies that help:
- Use hidden test cases in programming assignments
- Focus assessments on explaining code and data decisions, not just producing working code
- Build conceptual understanding so deeply that students can spot when AI-generated code doesn’t actually solve their specific problem
- Teach students to use AI as a learning tool (like asking it to explain error messages) rather than a shortcut
- Use pair programming or in-class activities to practice in a live environment
Quick Wins You Can Implement Today
Based on feedback from the webinar attendees—who cited challenges like “not having enough materials” and needing “real-world samples and simulations”—here are immediate actions:
- Assign even small amounts of interactive reading for credit. Research shows that making just 5% of course grades dependent on completing interactive activities dramatically reduces DFW rates.
- Start one lesson with an animation instead of code. Pick your most confusing function and visualize what it does before showing syntax.
- Add one “explain your reasoning” question to assessments. Force students to articulate why they chose a particular data wrangling approach.
- Create one custom lab using AI assistance as a starting point. New tools can generate starter code, test cases, and instructions from natural language prompts—then you refine them.
- Use real, messy datasets. Students need practice with data that has missing values, inconsistent formatting, and unexpected quirks—just like they’ll encounter professionally.
The Bottom Line
Data wrangling isn’t just a prerequisite for “real” data science—it is real data science. According to a 2020 survey of data science instructors across math, statistics, and computer science departments, 3 in 4 considered data wrangling essential for intro courses. Those who didn’t include it in the intro course had dedicated entire subsequent courses to it.
Whether you teach with Python, R, or Excel, the fundamental goal remains the same: students who understand both what data wrangling operations do and why they matter will be better equipped for the messy reality of professional data science work.
So maybe it’s time to let students be a little bored with concepts before they get excited about code. The strong foundation you build will serve them throughout their careers.
Want to explore these teaching approaches further? The full webinar recording covers detailed demonstrations of interactive animations, embedded Jupyter notebooks, challenge activities, and custom lab creation—all with real examples from data science foundations courses.
Key Takeaway: Students who have a deep understanding of data wrangling concepts can transfer those skills across tools, communicate their reasoning effectively, and adapt to new technologies throughout their careers. That’s worth a few minutes of early-course boredom.
