🧹 The Art of Data Cleaning
- Combine multiple datasets using
mergeandconcat - Perform powerful aggregations with
groupby - Handle missing data (`NaN`) strategies
- Clean messy text data using vectorised string methods
- Chain methods for readable data pipelines
isna(), fillna(), dropna() to handle them.
df.dropna().groupby('col').mean()). Creates readable pipelines.
🔗 Merging & Joining
Combine DataFrames using merge() and concat(). Essential for integrating data from multiple sources (like SQL joins).
📊 GroupBy Aggregations
Use groupby() to split data by categories, apply functions (sum, mean, count), and create summary statistics.
🧹 Handling Missing Data
Detect and handle NaN values with isna(), fillna(), and dropna(). Critical for clean datasets.
🔤 String Operations
Clean text data with vectorized .str methods—split, strip, replace, extract patterns—no loops needed.
Merge: Like combining ingredients from two separate recipe cards (left card + right card) based on matching ingredient names.
Concat: Stacking two trays of cookies on top of each other—simple vertical/horizontal combination without matching.
GroupBy: Sorting produce into bins (apples, oranges, bananas), then weighing each bin to get total weight per fruit type.
NaN (Missing Data): Empty slots in your ingredient list. You can fill them (buy more), drop them (skip recipe), or leave them empty.
String Methods: A label maker that cleans messy handwritten labels—trim spaces, capitalize, fix typos—automatically on all jars at once.
Method Chaining: A conveyor belt where each station performs one task (wash → peel → chop → cook) in sequence.
Data Scientists spend 80% of their time cleaning and preparing data, and only 20% analyzing it. Mastering these tools makes you efficient.
Real-World Applications:
- E-commerce: Merge customer data with order data to analyze purchasing behavior per demographic
- Finance: GroupBy transactions by account to calculate monthly spending per category
- Healthcare: Fill missing patient vitals using forward-fill or mean imputation strategies
- Marketing: Clean messy email addresses (trim whitespace, lowercase) before deduplication
Master GroupBy First: It's the most powerful and confusing concept. Practice split-apply-combine until it clicks.
Use Method Chaining: Chain operations for readable pipelines: df.dropna().groupby('col').mean() tells a data story.
Understand Join Types: Draw Venn diagrams for inner, left, right, outer joins. Visualize what rows survive each merge type.
Don't Loop—Vectorize: If you're using for row in df.iterrows(), there's always a vectorized alternative. Ask yourself "How can Pandas do this in one line?"
Test on Small Data First: Before running expensive operations (merge, groupby), test logic on df.head(100) to catch bugs fast.
🔗 Merging & Joining
Combining data from different sources is a fundamental skill.
# ❌ BAD: Looping to match data
users = [{'id': 1, 'name': 'Alice'}, {'id': 2, 'name': 'Bob'}]
depts = [{'id': 1, 'dept': 'HR'}, {'id': 2, 'dept': 'Engineering'}]
result = {}
for u in users:
for d in depts:
if u['id'] == d['id']:
result[u['name']] = d['dept']
print(result)
# Very slow O(N*M) complexity!
# 🔰 NOVICE: Simple inner merge
import pandas as pd
users = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
depts = pd.DataFrame({'id': [1, 2], 'dept': ['HR', 'Engineering']})
# Merge on 'id' column
merged = pd.merge(users, depts, on='id')
print(merged)
# ⭐ BEST PRACTICE: Controlling join type
import pandas as pd
users = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
depts = pd.DataFrame({'id': [1, 2], 'dept': ['HR', 'Engineering']})
# 'how' parameter controls behavior:
# 'inner' (default): Intersection
# 'left': Keep all users, fill missing depts with NaN
# 'outer': Union of both
merged = pd.merge(users, depts, on='id', how='left')
print(merged)
# Note: Carol has NaN department because id=3 isn't in depts
Scenario: You have three DataFrames: customers, orders, and products. Merge them to create a complete order report.
- Create a
customersDataFrame with columns:customer_id,name,city - Create an
ordersDataFrame with columns:order_id,customer_id,product_id,quantity - Create a
productsDataFrame with columns:product_id,product_name,price - Merge all three DataFrames to show: customer name, product name, quantity, and total price
- Use a left join to keep all orders, even if product info is missing
- Calculate total revenue per customer using groupby after merging
Expected Output: A DataFrame showing each order with customer details, product details, and calculated total (quantity × price).
- Forgetting Join Type: Default
merge()is inner join, which drops unmatched rows. Explicitly sethow='left'orhow='outer'to avoid silent data loss. - Duplicate Keys: If the key column has duplicates in both DataFrames, the result explodes (Cartesian product). Check for duplicates before merging.
- Column Name Collisions: Merging creates
_xand_ysuffixes for duplicate column names. Usesuffixes=('_left', '_right')for clarity. - Ignoring Index:
concat()preserves indexes by default, causing duplicate index values. Setignore_index=Trueto reset to 0, 1, 2...
- What's the difference between
merge()andjoin()? When would you use each? - How does an outer join differ from a union in SQL? What does Pandas do with unmatched rows?
- Why is it important to validate data before merging (e.g., checking for duplicate keys)?
- When would you use
concat()instead ofmerge()? Give examples of scenarios for each.
📦 Grouping & Aggregation
Split data into groups, apply a function, and combine results (Split-Apply-Combine).
import pandas as pd
df = pd.DataFrame({
'dept': ['HR', 'Eng', 'Eng', 'HR', 'Eng'],
'salary': [60000, 80000, 90000, 60000, 85000],
'age': [25, 30, 35, 22, 28]
})
# Group by Department and calculate multiple stats
summary = df.groupby('dept').agg({
'salary': ['mean', 'max']
})
print(summary)
Scenario: Analyze regional sales data using GroupBy with multiple aggregations.
- Create a sales DataFrame with columns:
region,salesperson,product,units_sold,revenue - Group by
regionand calculate: total revenue, average units sold, count of transactions - Group by
regionANDproductto find best-selling product per region - Use
.agg()with a dictionary to apply different functions to different columns - Find the top salesperson in each region using
idxmax()or sorting - Reset index and rename columns for a clean summary table
Bonus: Use .transform() to add a column showing each row's percentage of regional total revenue.
Scenario: Create Excel-style pivot table reports for business analysis.
- Create a sales DataFrame with:
date,region,category,product,revenue - Use
pd.pivot_table()withregionas rows andcategoryas columns - Add
margins=Trueto show row and column totals (like Excel) - Create a multi-index pivot table with
regionandcategoryas rows - Use
aggfuncparameter to show both sum and mean in the same pivot table - Use
fill_value=0to replace NaN with zeros in empty cells
Advanced: Create a crosstab showing count of products sold by region and day of week.
- Forgetting to Aggregate:
df.groupby('col')returns a GroupBy object, not a DataFrame. You must call.mean(),.sum(), etc., to get results. - Lost Columns: Aggregation functions only work on numeric columns by default. Non-numeric columns are silently dropped unless explicitly handled.
- Confusing agg() Syntax:
.agg({'col': ['mean', 'max']})creates multi-level columns. Use.agg({'col': 'mean'})for single-level or.reset_index()to flatten. - Index Confusion: GroupBy results have the grouped column(s) as the index. Use
.reset_index()to move them back to regular columns.
- What does "Split-Apply-Combine" mean in the context of GroupBy? Give a real-world example.
- How does
df.groupby('col').mean()differ fromdf['col'].mean()? - When would you use
.agg()with a dictionary vs. a single function? What are the trade-offs? - Why does GroupBy create a grouped index? When would you want to keep it vs. reset it?
❓ Handling Missing Data
Real data is messy. Pandas uses NaN (Not a Number) to represent missing values.
# 🔰 NOVICE: Just deleting missing rows
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan],
'B': [5, np.nan, np.nan]
})
# Drops ANY row containing NaN
clean = df.dropna()
print(clean)
# Result: Only row 0 remains. We lost data!
# ⭐ BEST PRACTICE: Imputation strategies
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan],
'B': [5, np.nan, np.nan]
})
# Fill A with mean of A
df['A'] = df['A'].fillna(df['A'].mean())
# Fill B with specific value (e.g., 0)
df['B'] = df['B'].fillna(0)
print(df)
Scenario: Work with stock price data that has missing values and datetime index.
- Create a stock DataFrame with
date,symbol,open,high,low,close,volume - Convert the
datecolumn to datetime and set it as the index - Introduce some NaN values to simulate missing trading days
- Use
ffill()(forward fill) to fill missing prices with the last known value - Resample to weekly data using
.resample('W').agg({'close': 'last', 'volume': 'sum'}) - Calculate 7-day rolling average using
.rolling(7).mean() - Handle volume NaN differently (fill with 0 since no trades occurred)
Bonus: Use interpolate(method='linear') for smoother missing value estimation.
- Not Checking for NaN: Forgetting to check
df.isna().sum()before analysis causes silent errors (NaN propagates through calculations). - Dropping Too Much Data:
dropna()without parameters drops any row with any NaN. Usesubset=['col']orthresh=nto be selective. - Blind Imputation: Filling NaN with mean/median/mode without understanding why data is missing can introduce bias. Always investigate the cause first.
- Overwriting Original Data:
fillna()anddropna()don't modify in-place by default. Assign result back:df = df.dropna()or useinplace=True.
- What are the three main strategies for handling missing data (drop, fill, predict)? When is each appropriate?
- Why is filling NaN with the mean sometimes a bad idea? What assumptions does it make about your data?
- How does forward-fill (
ffill()) differ from backward-fill (bfill())? When would you use each? - What ethical considerations arise when imputing missing data in healthcare or financial datasets?
🧵 String Operations
Pandas has a special .str accessor for vectorized string manipulation.
import pandas as pd
df = pd.DataFrame({
'name': [' Alice ', 'Bob', 'charlie'],
'email': ['alice@test.com', 'bob@gmail.com', 'char@test.com']
})
# Chain operations: strip whitespace -> upper case
df['name'] = df['name'].str.strip().str.upper()
# Filter by string contains
gmail_users = df[df['email'].str.contains('gmail')]
print(df['name'])
Scenario: Use apply() and transform() for complex row and column operations.
- Create a DataFrame with employee data:
name,department,salary,hire_date,performance_score - Use
apply()on a row to calculate a custom bonus:salary * 0.1if performance > 8, elsesalary * 0.05 - Use
apply()on a column to categorize salaries: 'Low', 'Medium', 'High' based on thresholds - Use
transform()to calculate each employee's salary as percentage of department total - Create a custom function that processes multiple columns and returns a new calculated field
- Compare
apply(axis=0)(column-wise) vsapply(axis=1)(row-wise) behavior
Key Insight: transform() returns same-shape output (for adding columns), while apply() can return any shape.
Scenario: Build a complete text cleaning pipeline using method chaining.
- Create a DataFrame with messy product descriptions containing: extra spaces, mixed case, special characters, HTML tags
- Chain
.strmethods:strip() → lower() → replace() → extract() - Extract product codes using regex:
.str.extract(r'(\w{3}-\d{4})') - Remove HTML tags using
.str.replace(r'<.*?>', '', regex=True) - Split product names into separate columns (brand, model, variant)
- Flag invalid entries where cleaning produced empty strings
Advanced: Use apply() with a custom function for complex cleaning logic that can't be done with .str alone.
- Using Loops Instead of .str: Never loop through rows for string operations.
df['col'].str.lower()is 100x faster thanapply(lambda x: x.lower()). - NaN Breaks String Methods:
.strmethods fail on NaN values. Usedf['col'].fillna('').str.method()or.str.method(na=False)to handle missing data. - Regex Confusion:
.str.contains()uses regex by default. To search for literal special characters like.or$, setregex=False. - Not Chaining .str Methods: You can chain multiple operations:
df['col'].str.strip().str.lower().str.replace(' ', '_'). Use it for readable pipelines.
- Why are vectorized string operations (.str) faster than loops? What happens under the hood?
- When would you use
.str.extract()vs..str.split()for parsing structured text data? - How can you use
.str.contains()with regex to filter emails by domain (e.g., all gmail.com addresses)? - What strategies can you use to clean messy address data (extra spaces, inconsistent capitalization, special characters)?
Challenge: Given a DataFrame with a messy email column:
- Remove leading/trailing whitespace
- Convert all emails to lowercase
- Extract domain names (everything after @) into a new column
- Filter rows where domain is "gmail.com"
- Flag invalid emails (no @ symbol) with a boolean column
Bonus: Use method chaining to do all operations in one line.
📜 Advanced Pandas Cheat Sheet
# ═══ MERGING ═══
pd.merge(df1, df2, on='id', how='inner')
pd.concat([df1, df2]) # Stack vertically
# ═══ GROUPBY ═══
df.groupby('col').mean()
df.groupby('col').agg({'val': ['min', 'max']})
# ═══ MISSING DATA ═══
df.isna().sum() # Count missing
df.dropna() # Drop rows
df.fillna(value) # Fill missing
# ═══ STRINGS ═══
df['s'].str.lower()
df['s'].str.contains('pattern')
df['s'].str.replace('old', 'new')
- Master Merge Types: Understand inner, left, right, outer joins. Default is inner—explicitly set
how=to avoid silent data loss. - GroupBy = Split-Apply-Combine: Split data by groups, apply aggregation functions, combine results. Essential for summary statistics.
- Check for NaN First: Use
df.isna().sum()before analysis. NaN values propagate through calculations and break visualizations. - Choose Imputation Wisely: Filling NaN with mean/median/mode can introduce bias. Understand why data is missing before filling.
- Vectorize String Operations: Never loop for string cleaning. Use
.strmethods—they're 100x faster and handle NaN gracefully. - Method Chaining for Readability: Chain operations to create readable pipelines:
df.dropna().groupby('col').mean().reset_index(). - Reset Index After GroupBy: GroupBy results have grouped column(s) as index. Use
.reset_index()to move them back to columns. - Test on Small Data First: Before expensive operations (merge, groupby), test logic on
df.head(100)to catch bugs quickly.