🧹 The Art of Data Cleaning

🎯 Learning Objectives
  • Combine multiple datasets using merge and concat
  • Perform powerful aggregations with groupby
  • Handle missing data (`NaN`) strategies
  • Clean messy text data using vectorised string methods
  • Chain methods for readable data pipelines
📚 Key Vocabulary
Merge Combining two DataFrames based on common column(s) (like SQL JOIN). Types: inner, left, right, outer.
Concat Stacking DataFrames vertically (rows) or horizontally (columns) without matching keys. Useful for combining similar datasets.
GroupBy Split-Apply-Combine pattern: split data into groups by column(s), apply aggregation functions, combine results into summary DataFrame.
NaN "Not a Number"—Pandas representation of missing values. Use isna(), fillna(), dropna() to handle them.
Vectorized Operations Operations applied to entire columns at once (not loops). Fast because Pandas uses optimized C/NumPy code under the hood.
Method Chaining Calling multiple DataFrame methods in sequence (e.g., df.dropna().groupby('col').mean()). Creates readable pipelines.

🔗 Merging & Joining

Combine DataFrames using merge() and concat(). Essential for integrating data from multiple sources (like SQL joins).

📊 GroupBy Aggregations

Use groupby() to split data by categories, apply functions (sum, mean, count), and create summary statistics.

🧹 Handling Missing Data

Detect and handle NaN values with isna(), fillna(), and dropna(). Critical for clean datasets.

🔤 String Operations

Clean text data with vectorized .str methods—split, strip, replace, extract patterns—no loops needed.

🎯 Analogy: Pandas Advanced = Kitchen Tools

Merge: Like combining ingredients from two separate recipe cards (left card + right card) based on matching ingredient names.

Concat: Stacking two trays of cookies on top of each other—simple vertical/horizontal combination without matching.

GroupBy: Sorting produce into bins (apples, oranges, bananas), then weighing each bin to get total weight per fruit type.

NaN (Missing Data): Empty slots in your ingredient list. You can fill them (buy more), drop them (skip recipe), or leave them empty.

String Methods: A label maker that cleans messy handwritten labels—trim spaces, capitalize, fix typos—automatically on all jars at once.

Method Chaining: A conveyor belt where each station performs one task (wash → peel → chop → cook) in sequence.

💡 80/20 Rule

Data Scientists spend 80% of their time cleaning and preparing data, and only 20% analyzing it. Mastering these tools makes you efficient.

Real-World Applications:

  • E-commerce: Merge customer data with order data to analyze purchasing behavior per demographic
  • Finance: GroupBy transactions by account to calculate monthly spending per category
  • Healthcare: Fill missing patient vitals using forward-fill or mean imputation strategies
  • Marketing: Clean messy email addresses (trim whitespace, lowercase) before deduplication
💡 Learning Strategy: Advanced Pandas

Master GroupBy First: It's the most powerful and confusing concept. Practice split-apply-combine until it clicks.

Use Method Chaining: Chain operations for readable pipelines: df.dropna().groupby('col').mean() tells a data story.

Understand Join Types: Draw Venn diagrams for inner, left, right, outer joins. Visualize what rows survive each merge type.

Don't Loop—Vectorize: If you're using for row in df.iterrows(), there's always a vectorized alternative. Ask yourself "How can Pandas do this in one line?"

Test on Small Data First: Before running expensive operations (merge, groupby), test logic on df.head(100) to catch bugs fast.

🔗 Merging & Joining

Combining data from different sources is a fundamental skill.

Python ❌ Bad Practice Manual Loops
# ❌ BAD: Looping to match data
users = [{'id': 1, 'name': 'Alice'}, {'id': 2, 'name': 'Bob'}]
depts = [{'id': 1, 'dept': 'HR'}, {'id': 2, 'dept': 'Engineering'}]

result = {}
for u in users:
    for d in depts:
        if u['id'] == d['id']:
            result[u['name']] = d['dept']

print(result)
# Very slow O(N*M) complexity!
Python 🔰 Novice Basic Merge
# 🔰 NOVICE: Simple inner merge
import pandas as pd

users = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
depts = pd.DataFrame({'id': [1, 2], 'dept': ['HR', 'Engineering']})

# Merge on 'id' column
merged = pd.merge(users, depts, on='id')
print(merged)
Python ⭐ Best Practice Explicit Joins
# ⭐ BEST PRACTICE: Controlling join type
import pandas as pd

users = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
depts = pd.DataFrame({'id': [1, 2], 'dept': ['HR', 'Engineering']})

# 'how' parameter controls behavior:
# 'inner' (default): Intersection
# 'left': Keep all users, fill missing depts with NaN
# 'outer': Union of both
merged = pd.merge(users, depts, on='id', how='left')

print(merged)
# Note: Carol has NaN department because id=3 isn't in depts
🏋️ Exercise: Customer Orders

Scenario: You have three DataFrames: customers, orders, and products. Merge them to create a complete order report.

  • Create a customers DataFrame with columns: customer_id, name, city
  • Create an orders DataFrame with columns: order_id, customer_id, product_id, quantity
  • Create a products DataFrame with columns: product_id, product_name, price
  • Merge all three DataFrames to show: customer name, product name, quantity, and total price
  • Use a left join to keep all orders, even if product info is missing
  • Calculate total revenue per customer using groupby after merging

Expected Output: A DataFrame showing each order with customer details, product details, and calculated total (quantity × price).

⚠️ Common Pitfalls: Merging
  • Forgetting Join Type: Default merge() is inner join, which drops unmatched rows. Explicitly set how='left' or how='outer' to avoid silent data loss.
  • Duplicate Keys: If the key column has duplicates in both DataFrames, the result explodes (Cartesian product). Check for duplicates before merging.
  • Column Name Collisions: Merging creates _x and _y suffixes for duplicate column names. Use suffixes=('_left', '_right') for clarity.
  • Ignoring Index: concat() preserves indexes by default, causing duplicate index values. Set ignore_index=True to reset to 0, 1, 2...
💬 Discussion Questions
  1. What's the difference between merge() and join()? When would you use each?
  2. How does an outer join differ from a union in SQL? What does Pandas do with unmatched rows?
  3. Why is it important to validate data before merging (e.g., checking for duplicate keys)?
  4. When would you use concat() instead of merge()? Give examples of scenarios for each.

📦 Grouping & Aggregation

Split data into groups, apply a function, and combine results (Split-Apply-Combine).

Python ⭐ Powerful Aggregation
import pandas as pd

df = pd.DataFrame({
    'dept': ['HR', 'Eng', 'Eng', 'HR', 'Eng'],
    'salary': [60000, 80000, 90000, 60000, 85000],
    'age': [25, 30, 35, 22, 28]
})

# Group by Department and calculate multiple stats
summary = df.groupby('dept').agg({
    'salary': ['mean', 'max']
})

print(summary)
🏋️ Exercise: Sales By Region

Scenario: Analyze regional sales data using GroupBy with multiple aggregations.

  • Create a sales DataFrame with columns: region, salesperson, product, units_sold, revenue
  • Group by region and calculate: total revenue, average units sold, count of transactions
  • Group by region AND product to find best-selling product per region
  • Use .agg() with a dictionary to apply different functions to different columns
  • Find the top salesperson in each region using idxmax() or sorting
  • Reset index and rename columns for a clean summary table

Bonus: Use .transform() to add a column showing each row's percentage of regional total revenue.

🚀 Challenge: Excel-Style Reports (Pivot Tables)

Scenario: Create Excel-style pivot table reports for business analysis.

  • Create a sales DataFrame with: date, region, category, product, revenue
  • Use pd.pivot_table() with region as rows and category as columns
  • Add margins=True to show row and column totals (like Excel)
  • Create a multi-index pivot table with region and category as rows
  • Use aggfunc parameter to show both sum and mean in the same pivot table
  • Use fill_value=0 to replace NaN with zeros in empty cells

Advanced: Create a crosstab showing count of products sold by region and day of week.

⚠️ Common Pitfalls: GroupBy
  • Forgetting to Aggregate: df.groupby('col') returns a GroupBy object, not a DataFrame. You must call .mean(), .sum(), etc., to get results.
  • Lost Columns: Aggregation functions only work on numeric columns by default. Non-numeric columns are silently dropped unless explicitly handled.
  • Confusing agg() Syntax: .agg({'col': ['mean', 'max']}) creates multi-level columns. Use .agg({'col': 'mean'}) for single-level or .reset_index() to flatten.
  • Index Confusion: GroupBy results have the grouped column(s) as the index. Use .reset_index() to move them back to regular columns.
💬 Discussion Questions
  1. What does "Split-Apply-Combine" mean in the context of GroupBy? Give a real-world example.
  2. How does df.groupby('col').mean() differ from df['col'].mean()?
  3. When would you use .agg() with a dictionary vs. a single function? What are the trade-offs?
  4. Why does GroupBy create a grouped index? When would you want to keep it vs. reset it?

❓ Handling Missing Data

Real data is messy. Pandas uses NaN (Not a Number) to represent missing values.

Python 🔰 Novice Drop NA
# 🔰 NOVICE: Just deleting missing rows
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [5, np.nan, np.nan]
})

# Drops ANY row containing NaN
clean = df.dropna()
print(clean)
# Result: Only row 0 remains. We lost data!
Python ⭐ Best Practice Smart Filling
# ⭐ BEST PRACTICE: Imputation strategies
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [5, np.nan, np.nan]
})

# Fill A with mean of A
df['A'] = df['A'].fillna(df['A'].mean())

# Fill B with specific value (e.g., 0)
df['B'] = df['B'].fillna(0)

print(df)
🏋️ Exercise: Stock Data Analysis

Scenario: Work with stock price data that has missing values and datetime index.

  • Create a stock DataFrame with date, symbol, open, high, low, close, volume
  • Convert the date column to datetime and set it as the index
  • Introduce some NaN values to simulate missing trading days
  • Use ffill() (forward fill) to fill missing prices with the last known value
  • Resample to weekly data using .resample('W').agg({'close': 'last', 'volume': 'sum'})
  • Calculate 7-day rolling average using .rolling(7).mean()
  • Handle volume NaN differently (fill with 0 since no trades occurred)

Bonus: Use interpolate(method='linear') for smoother missing value estimation.

⚠️ Common Pitfalls: Missing Data
  • Not Checking for NaN: Forgetting to check df.isna().sum() before analysis causes silent errors (NaN propagates through calculations).
  • Dropping Too Much Data: dropna() without parameters drops any row with any NaN. Use subset=['col'] or thresh=n to be selective.
  • Blind Imputation: Filling NaN with mean/median/mode without understanding why data is missing can introduce bias. Always investigate the cause first.
  • Overwriting Original Data: fillna() and dropna() don't modify in-place by default. Assign result back: df = df.dropna() or use inplace=True.
💬 Discussion Questions
  1. What are the three main strategies for handling missing data (drop, fill, predict)? When is each appropriate?
  2. Why is filling NaN with the mean sometimes a bad idea? What assumptions does it make about your data?
  3. How does forward-fill (ffill()) differ from backward-fill (bfill())? When would you use each?
  4. What ethical considerations arise when imputing missing data in healthcare or financial datasets?

🧵 String Operations

Pandas has a special .str accessor for vectorized string manipulation.

Python ⭐ Vectorized Strings
import pandas as pd

df = pd.DataFrame({
    'name': [' Alice ', 'Bob', 'charlie'],
    'email': ['alice@test.com', 'bob@gmail.com', 'char@test.com']
})

# Chain operations: strip whitespace -> upper case
df['name'] = df['name'].str.strip().str.upper()

# Filter by string contains
gmail_users = df[df['email'].str.contains('gmail')]

print(df['name'])
🏋️ Exercise: Custom Transformations

Scenario: Use apply() and transform() for complex row and column operations.

  • Create a DataFrame with employee data: name, department, salary, hire_date, performance_score
  • Use apply() on a row to calculate a custom bonus: salary * 0.1 if performance > 8, else salary * 0.05
  • Use apply() on a column to categorize salaries: 'Low', 'Medium', 'High' based on thresholds
  • Use transform() to calculate each employee's salary as percentage of department total
  • Create a custom function that processes multiple columns and returns a new calculated field
  • Compare apply(axis=0) (column-wise) vs apply(axis=1) (row-wise) behavior

Key Insight: transform() returns same-shape output (for adding columns), while apply() can return any shape.

🚀 Challenge: Text Data Pipeline

Scenario: Build a complete text cleaning pipeline using method chaining.

  • Create a DataFrame with messy product descriptions containing: extra spaces, mixed case, special characters, HTML tags
  • Chain .str methods: strip() → lower() → replace() → extract()
  • Extract product codes using regex: .str.extract(r'(\w{3}-\d{4})')
  • Remove HTML tags using .str.replace(r'<.*?>', '', regex=True)
  • Split product names into separate columns (brand, model, variant)
  • Flag invalid entries where cleaning produced empty strings

Advanced: Use apply() with a custom function for complex cleaning logic that can't be done with .str alone.

⚠️ Common Pitfalls: String Operations
  • Using Loops Instead of .str: Never loop through rows for string operations. df['col'].str.lower() is 100x faster than apply(lambda x: x.lower()).
  • NaN Breaks String Methods: .str methods fail on NaN values. Use df['col'].fillna('').str.method() or .str.method(na=False) to handle missing data.
  • Regex Confusion: .str.contains() uses regex by default. To search for literal special characters like . or $, set regex=False.
  • Not Chaining .str Methods: You can chain multiple operations: df['col'].str.strip().str.lower().str.replace(' ', '_'). Use it for readable pipelines.
💬 Discussion Questions
  1. Why are vectorized string operations (.str) faster than loops? What happens under the hood?
  2. When would you use .str.extract() vs. .str.split() for parsing structured text data?
  3. How can you use .str.contains() with regex to filter emails by domain (e.g., all gmail.com addresses)?
  4. What strategies can you use to clean messy address data (extra spaces, inconsistent capitalization, special characters)?
🎮 Practice Playground: Email Cleaner

Challenge: Given a DataFrame with a messy email column:

  • Remove leading/trailing whitespace
  • Convert all emails to lowercase
  • Extract domain names (everything after @) into a new column
  • Filter rows where domain is "gmail.com"
  • Flag invalid emails (no @ symbol) with a boolean column

Bonus: Use method chaining to do all operations in one line.

📜 Advanced Pandas Cheat Sheet

Patterns Reference
# ═══ MERGING ═══
pd.merge(df1, df2, on='id', how='inner')
pd.concat([df1, df2])  # Stack vertically

# ═══ GROUPBY ═══
df.groupby('col').mean()
df.groupby('col').agg({'val': ['min', 'max']})

# ═══ MISSING DATA ═══
df.isna().sum()        # Count missing
df.dropna()            # Drop rows
df.fillna(value)       # Fill missing

# ═══ STRINGS ═══
df['s'].str.lower()
df['s'].str.contains('pattern')
df['s'].str.replace('old', 'new')
🎯 Key Takeaways: Advanced Pandas
  • Master Merge Types: Understand inner, left, right, outer joins. Default is inner—explicitly set how= to avoid silent data loss.
  • GroupBy = Split-Apply-Combine: Split data by groups, apply aggregation functions, combine results. Essential for summary statistics.
  • Check for NaN First: Use df.isna().sum() before analysis. NaN values propagate through calculations and break visualizations.
  • Choose Imputation Wisely: Filling NaN with mean/median/mode can introduce bias. Understand why data is missing before filling.
  • Vectorize String Operations: Never loop for string cleaning. Use .str methods—they're 100x faster and handle NaN gracefully.
  • Method Chaining for Readability: Chain operations to create readable pipelines: df.dropna().groupby('col').mean().reset_index().
  • Reset Index After GroupBy: GroupBy results have grouped column(s) as index. Use .reset_index() to move them back to columns.
  • Test on Small Data First: Before expensive operations (merge, groupby), test logic on df.head(100) to catch bugs quickly.