Week 5: Data Wrangling & Aggregation

👋 Welcome to Week 5: Data Wrangling & Aggregation

This week you'll dive deep into Pandas data manipulation. Building on your Week 4 foundation, you'll learn to select specific data subsets, aggregate data by groups, merge multiple datasets, and transform columns for analysis.

🎯 What You'll Learn Today

Data Selection — Master .loc[], .iloc[], boolean filtering, and column selection
Aggregation — Use groupby(), agg(), sum, mean, count to summarize data
Merging — Combine DataFrames with merge() and different join types
Transformation — Create and modify columns using .map() and operators
Pivot Tables — Reorganize data for better analysis and presentation

📚 Key Terms to Know

Series

A one-dimensional labeled array. Like a column in a spreadsheet or a Python list with labels.

DataFrame

A two-dimensional labeled table. Like a spreadsheet or SQL table in Python.

Index

The row labels. Can be numbers (0, 1, 2...) or custom labels (dates, IDs, names).

ETL

Extract, Transform, Load—the process of getting raw data, cleaning it, and preparing it for analysis.

dtype

Data type of a column (int64, float64, object, bool, datetime64). Critical for memory and performance.

Vectorization

Applying operations to entire columns at once instead of looping. Much faster than row-by-row iteration.

🗺️ Today's Topics

Click any card below to jump directly to that section:

🐼

Pandas Setup & Imports

Learn the standard way to import Pandas and verify your environment.

📊

Series: 1D Labeled Data

Master the Series object—labeled arrays that power DataFrame columns.

📋

DataFrames: 2D Tables

Create and manipulate DataFrames—the heart of Pandas data analysis.

📁

Loading External Data

Read CSV, Excel, JSON files into DataFrames with proper error handling.

🔍

Data Inspection & Exploration

Use head(), info(), describe() to understand datasets before analysis.

💡 Think of Pandas Like Excel... But Supercharged

Series = Single Column: One column of data with a label for each row.

DataFrame = Entire Spreadsheet: Multiple columns of different types, all aligned by row.

Index = Row Numbers: But you can use dates, IDs, or any labels instead of just 0, 1, 2...

The Difference? Pandas handles millions of rows, does calculations instantly, and lets you write code to automate everything you'd click in Excel!

🌍 Why Pandas Matters in the Real World

Pandas is the foundation of data science in Python. Every data job you'll encounter uses it:

• Data Analysis — Clean, filter, aggregate, and visualize datasets

• Machine Learning — Prepare data for scikit-learn, TensorFlow, PyTorch

• Business Intelligence — Generate reports, dashboards, and insights

• ETL Pipelines — Extract data from APIs/databases, transform it, load to warehouses

• Financial Analysis — Time series, stock data, portfolio optimization

📖 How to Use This Lesson

Each topic shows the same problem solved 4 different ways:

🔴 Bad Practice — Code that works but has problems (inefficient, error-prone, unprofessional)

🟠 Novice Solution — Basic working code that beginners write

🔵 Intermediate Solution — Better code with improvements

🟢 Best Practice — Professional, production-ready code

Focus on understanding why each level improves, not just memorizing syntax!

📚 How to Approach This Lesson

1. Think in columns, not rows: Pandas is optimized for column operations. Avoid looping through rows whenever possible.

2. Check types early: Use df.info() and df.dtypes immediately. Wrong types cause bugs and slow performance.

3. Start small: Test operations on df.head() before running on the full dataset.

4. Read error messages: Pandas errors are detailed. They often tell you exactly what's wrong.

5. Practice with real data: Download a CSV from Kaggle or data.gov and explore it using these techniques.

🐼 Pandas Basics & Setup

Starting your data engineering journey begins with the right setup. How you import and initialize Pandas matters.

Python ❌ Bad Practice Star Import

# ❌ BAD: Wildcard import or no alias
# Problem 1: Imports ALL names from pandas (1000+ objects!)
# Problem 2: Overwrites Python built-ins like sum(), min(), max()
# Problem 3: Makes code impossible to debug - where did DataFrame come from?
from pandas import *

# Now these are in your namespace, but you can't tell where they're from:
df = DataFrame({'a': [1, 2]})  # ⚠️ Is this pandas.DataFrame? Hard to tell!

# Even worse - pandas overwrote Python's built-in sum()!
# sum([1, 2, 3])  # Might fail or give unexpected results

⚠️ Why This Is Bad

Namespace pollution: Pandas adds 1000+ names to your global scope, making name collisions likely.

Shadows built-ins: Pandas has its own sum(), min(), max() that override Python's built-in functions.

Unreadable: Other developers (and future you) can't tell where DataFrame came from without checking imports.

Tooling breaks: IDEs and linters can't provide accurate autocomplete or type checking.

Python 🔰 Novice Full Name Import

# 🔰 NOVICE: Importing without alias
# ✓ No namespace pollution
# ✓ Clear where DataFrame comes from
# ✗ Verbose - have to type 'pandas.' every time
import pandas

# Now it's explicit - DataFrame comes from pandas
df = pandas.DataFrame({'a': [1, 2]})
print(df)

# But typing 'pandas.' gets tedious in real projects...
series = pandas.Series([1, 2, 3])  # Too much typing!

✅ What's Better

Safe: No namespace pollution or name conflicts.

Explicit: pandas.DataFrame clearly shows where the class comes from.

But: Writing pandas. hundreds of times in a data analysis script is tedious and hurts readability.

Python 📈 Intermediate Standard Alias

# 📈 INTERMEDIATE: Standard 'pd' alias
# ✓ The universal convention in data science
# ✓ Every data scientist uses 'pd' - instantly recognizable
# ✓ Short enough to not be tedious
import pandas as pd

# Clean, concise, and standard
df = pd.DataFrame({'a': [1, 2]})
print(df)

# 'pd' is so standard, you'll see it in:
# - Every tutorial
# - Every Kaggle notebook
# - Every data science job

✅ What's Better

Universal standard: pd is recognized by every data scientist worldwide.

Concise: Short enough to not clutter code, but explicit enough to be clear.

Tool support: IDEs and linters are optimized for import pandas as pd.

Python ⭐ Best Practice Full Setup

# ⭐ BEST PRACTICE: Standard imports + version verification
# ✓ Standard aliases everyone recognizes
# ✓ Import both pandas and numpy (pandas is built on numpy)
# ✓ Verify versions for reproducibility
import pandas as pd
import numpy as np

# CRITICAL in notebooks: Print versions for reproducibility
# Different pandas versions can behave differently!
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")

# Optional: Set display options for better readability
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.precision', 2)      # 2 decimal places

df = pd.DataFrame({'a': [1, 2]})
print(df)

⭐ Why This Is Best

Version checking: Pandas 2.x has breaking changes from 1.x. Always verify versions in production code.

Numpy included: Pandas is built on NumPy. You'll often need both, so import them together.

Display options: pd.set_option() makes DataFrames more readable in Jupyter notebooks.

Reproducibility: Anyone running your code knows exactly what versions you used.

⚠️ Common Pandas Import Pitfalls

Using non-standard aliases: Don't use import pandas as p or import pandas as panda. Always use pd.

Forgetting numpy: Most Pandas operations need NumPy. Import both at the start.

Not checking versions: Pandas 1.x code might break in 2.x. Always verify versions in production.

Importing specific functions: Don't do from pandas import DataFrame. Always use the pd. prefix for clarity.

💬 Group Discussion Questions

1. Why is using a standard alias like pd important for collaboration? How would non-standard aliases hurt a team project?

2. When would you need to check the pandas version? Have you ever encountered code that broke due to version differences?

3. What other libraries have universally-recognized aliases? (Hint: numpy → np, matplotlib.pyplot → plt...)

4. Why does from pandas import * break things, but import pandas as pd doesn't?

📓 Practice in Notebook

Open notebook-sessions/week4/session1_data_engineering.ipynb and implement standard imports (pd, np), version checks, and display options. Add a cell explaining why version verification matters.

📊 Pandas Series: 1D Labeled Data

A Series is a one-dimensional labeled array capable of holding any data type. Think of it as a column in a spreadsheet or a Python list with labels instead of just positions.

📖 Series Vocabulary

Series — A 1D array with labeled indices (like a single column)

index — The row labels (can be numbers, strings, dates, etc.)

values — The actual data as a NumPy array

dtype — The data type of the Series (int64, float64, object, etc.)

name — Optional label for the Series (becomes column name in DataFrame)

💡 Series vs Python Lists

Python List: [1.2, 0.5, 3.0] — Access by position: prices[0]

Pandas Series: {'Apple': 1.2, 'Banana': 0.5, 'Cherry': 3.0} — Access by label: prices['Apple']

Why use Series? Labeled indices, automatic alignment, vectorized operations, and missing data handling!

Python ❌ Bad Practice Using Parallel Lists

# ❌ BAD: Using parallel lists for labeled data
products = ['Apple', 'Banana', 'Cherry']
prices = [1.20, 0.50, 3.00]

# Hard to maintain connection between product and price
idx = products.index('Apple')
print(f"Price of {products[idx]}: {prices[idx]}")

# No built-in math operations
# total = sum(prices) # Works, but slow for large data

Python 🔰 Novice Basic Series

# 🔰 NOVICE: Creating a simple Series
import pandas as pd

prices = pd.Series([1.20, 0.50, 3.00])

# Default integer index (0, 1, 2...)
print(prices)

Python 📈 Intermediate Indexed Series

# 📈 INTERMEDIATE: Series with Index
import pandas as pd

# Mapping labels to data
prices = pd.Series(
    [1.20, 0.50, 3.00],
    index=['Apple', 'Banana', 'Cherry']
)

print(prices)
print(prices['Apple'])  # Access by label!

Python ⭐ Best Practice Vectorized Operations

# ⭐ BEST PRACTICE: Named Series & Vectorization
import pandas as pd

prices = pd.Series(
    [1.20, 0.50, 3.00],
    index=['Apple', 'Banana', 'Cherry'],
    name='Fruit Prices'
)

# Vectorized operation: Apply discount to ALL items at once
# No loop needed! Much faster than lists.
discounted = prices * 0.9

print("Discounted Prices:")
print(discounted)

⭐ Pro Tip: Vectorization

Never loop over a Series if you can avoid it! Operations like series * 0.9 happen in C at lightning speed.

🎮 Try It: Create a Series

Challenge

Create a Series representing the population of 3 cities (NY, LA, Chicago) in millions. Then, calculate the population if it grows by 10%.

📓 Practice in Notebook

Open notebook-sessions/week4/session1_data_engineering.ipynb and build a Series with labels. Practice vectorized operations (discounts, taxes) and label-based access with .loc.

⚠️ Common Series Pitfalls

Forgetting alignment: When you add two Series with different indices, pandas aligns them automatically—missing indices become NaN!

Confusing .loc vs []: series[0] gets by position (not label!). Use series.iloc[0] for position, series.loc['label'] for labels.

Not checking dtype: A Series with mixed types becomes object dtype—very slow! Always use series.dtype to check.

Looping over Series: Never use for item in series: for calculations. Use vectorized operations instead!

💬 Group Discussion Questions

1. When would you use a Series instead of a Python list? What advantages does it offer?

2. What happens when you perform arithmetic on two Series with different indices? Try it!

3. Why is labeled indexing more powerful than positional indexing? Give a real-world example.

4. How does a Series relate to a DataFrame? (Hint: Each DataFrame column IS a Series!)

📋 DataFrames: 2D Tables

The DataFrame is the heart of Pandas. It's a table where columns can have different types.

Python ❌ Bad Practice List of Lists

# ❌ BAD: List of lists (Matrix style)
data = [
    ['Alice', 25, 'Engineer'],
    ['Bob', 30, 'Designer']
]

# Must remember index 0 is name, 1 is age...
print(data[0][0])  # Alice
# Error-prone and hard to read

Python 🔰 Novice Dict to DataFrame

# 🔰 NOVICE: Dictionary to DataFrame
import pandas as pd

data = {
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'job': ['Engineer', 'Designer']
}

df = pd.DataFrame(data)
print(df)

Python 📈 Intermediate Setting Index

# 📈 INTERMEDIATE: Specify Index
import pandas as pd

data = {
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'job': ['Engineer', 'Designer']
}

# Use ID as index for fast lookups
df = pd.DataFrame(data, index=[101, 102])
df.index.name = 'id'

print(df)

Python ⭐ Best Practice Robust Creation

# ⭐ BEST PRACTICE: Robust creation and inspection
import pandas as pd

# List of dictionaries (row-oriented) is often cleaner for API data
rows = [
    {'id': 101, 'name': 'Alice', 'age': 25, 'job': 'Engineer'},
    {'id': 102, 'name': 'Bob', 'age': 30, 'job': 'Designer'}
]

df = pd.DataFrame(rows)
df.set_index('id', inplace=True)

print(df)
print("\nInfo:")
df.info() # Always verify types!

⚠️ Common DataFrame Pitfalls

Modifying copies: df[df['age'] > 25]['name'] = 'X' might not work! Use df.loc[df['age'] > 25, 'name'] = 'X' instead.

Chained indexing: df['col'][0] = value causes SettingWithCopyWarning. Use df.loc[0, 'col'] = value.

Not using inplace carefully: inplace=True modifies the original DataFrame and returns None. Can cause bugs if you expect a return value!

Assuming order: DataFrames don't guarantee row order unless you explicitly sort. Always use proper indices or sorting.

🎮 Try It: Build Your Own DataFrame

Challenge

Create a DataFrame of at least 3 students with columns: name, grade, and favorite_subject. Set student IDs as the index.

Hint: Use df.set_index('student_id', inplace=True) or create with pd.DataFrame(students).set_index('student_id').

📓 Practice in Notebook

Open notebook-sessions/week4/session1_data_engineering.ipynb and create DataFrames from dicts and rows (list of dicts). Set a meaningful index (e.g., id) and verify with df.info().

💬 Group Discussion Questions

1. Why is using set_index() better than relying on the default 0, 1, 2... indices? When would you use each?

2. What's the difference between creating a DataFrame from a dict of lists vs a list of dicts? Which is more intuitive for row-oriented data (like API responses)?

3. When should you call df.info()? Why is checking dtypes important before starting analysis?

4. How would you explain the relationship between Series and DataFrames to someone new to Pandas?

📓 Practice in Notebook

Open notebook-sessions/week4/session2_data_engineering_group.ipynb and load sample CSV/JSON files. Use parse_dates, dtype, and usecols. Add error handling and print shape.

📂 Loading Data (ETL)

Extracting data from files is the first step in any pipeline.

Python ❌ Bad Practice Manual Parsing

# ❌ BAD: Manual file reading
with open('data.csv', 'r') as f:
    lines = f.readlines()
    
# Nightmare to handle:
# - Quoted strings with commas ("New York, NY")
# - Missing values
# - Type conversion (everything is a string)
print(lines[:2])

Python 🔰 Novice Basic read_csv

# 🔰 NOVICE: Basic read_csv
import pandas as pd

# Reads the file, but defaults usually aren't enough
df = pd.read_csv('data.csv')

# Problem: Dates are read as strings (object)
print(df.head(1))

Python 📈 Intermediate Parse Dates

# 📈 INTERMEDIATE: Parsing dates and index
import pandas as pd

df = pd.read_csv(
    'data.csv',
    parse_dates=['date'],  # Convert to datetime objects
    index_col='date'       # Use date as index
)

print(df.head(1))

Python ⭐ Best Practice Robust ETL Loading

# ⭐ BEST PRACTICE: Robust configuration
import pandas as pd

# Define types explicitly to save memory and prevent errors
dtypes = {
    'category': 'category',  # Efficient for repeated strings
    'id': 'str',             # IDs should be strings, not numbers
    'value': 'float32'       # Use smaller float if precision allows
}

try:
    df = pd.read_csv(
        'data.csv',
        dtype=dtypes,
        parse_dates=['timestamp'],
        na_values=['n/a', 'missing', '-'],  # Handle custom nulls
        usecols=['id', 'timestamp', 'category', 'value'] # Load only what you need
    )
    print(f"Data loaded successfully. Shape: {df.shape}")
except FileNotFoundError:
    print("Error: File not found.")

⚠️ Common Data Loading Pitfalls

Not handling errors: Always use try/except when loading files. Files can be missing, corrupted, or have wrong formats.

Wrong encoding: CSV files can have different encodings (utf-8, latin-1, cp1252). Use encoding='utf-8' or try different encodings if you see errors.

Not checking shape: Always check df.shape after loading. If you expect 1000 rows but get 10, something went wrong!

Ignoring parse dates: Dates load as strings by default. Use parse_dates=['date_column'] to auto-convert them.

💬 Group Discussion Questions

1. Why is error handling critical when loading external data? What could go wrong?

2. When would you use usecols to load only specific columns? How does this save memory?

3. What's the difference between read_csv() and read_excel()? Which is faster and why?

4. How would you handle a 10GB CSV file that doesn't fit in memory? (Hint: chunksize parameter!)

📓 Practice in Notebook

Open notebook-sessions/week4/session2_data_engineering_group.ipynb and perform the inspection trilogy on loaded data: df.shape, df.info(), df.isna().sum(), and df.describe().

🔍 Data Inspection

Trust but verify. Always inspect data immediately after loading.

Python ❌ Bad Practice Print Everything

# ❌ BAD: Printing the whole dataframe
print(df)

# Problems:
# - Floods the console/notebook
# - Slow for large datasets
# - Doesn't show data types or nulls

Python 🔰 Novice Using head()

# 🔰 NOVICE: Just checking the top
print(df.head())

# Better, but misses hidden issues like:
# - Wrong data types (numbers stored as strings)
# - Null values further down

Python ⭐ Best Practice Comprehensive Audit

# ⭐ BEST PRACTICE: The Inspection Trilogy
# 1. Shape (Dimensions)
print(f"SHAPE: {df.shape}")

# 2. Info (Types & Memory)
print("\nTYPES:")
print(df.dtypes)

# 3. Null Audit (Crucial!)
print("\nMISSING:")
print(df.isna().sum())

# 4. Statistics (Distribution)
# print(df.describe())

⚠️ Common Data Inspection Pitfalls

Only checking .head(): The first 5 rows might look perfect, but row 1000 has corrupted data. Always check .tail(), sample(), and info().

Ignoring NaN counts: Use df.isna().sum() to find missing data. Some columns might be 90% NaN!

Not verifying dtypes: A date column as 'object' (string) is a red flag. Check with df.dtypes or df.info().

Assuming clean data: Real-world data is NEVER clean. Always inspect, validate, and handle edge cases.

💬 Group Discussion Questions

1. Why is df.info() more valuable than df.head() for understanding a new dataset?

2. What would you check first when handed a brand new CSV file? Create a checklist!

3. How do describe() statistics help you spot outliers or data quality issues?

4. When would you use df.sample(10) instead of df.head(10)?

📚 Supplemental Materials: Data Visualization

Enhance your data wrangling skills with visualization techniques. Seeing your aggregated data helps validate your transformations and communicate insights effectively.

🎯 Advanced Topics

Matplotlib Basics — Create line plots, bar charts, and scatter plots to visualize wrangled data
Seaborn Power — Statistical visualizations with built-in styling
Customization — Colors, labels, titles, and professional styling
Subplots — Create multi-panel figures for comparing different data views

📖 Extended Lesson: Data Visualization

For a comprehensive introduction to data visualization with Matplotlib and Seaborn, explore the supplemental lesson materials:

📊 Open Data Visualization Lesson →

Topics covered: Matplotlib fundamentals, Seaborn statistical plots, customization techniques, creating subplots, and visualization best practices.

💡 Why Visualization Matters for Data Wrangling

Visualization isn't just for final reports—it's a critical part of data exploration:

Verify your groupby() aggregations with bar charts
Spot outliers visually before they skew your merge operations
Confirm pivot table results with heatmaps
Validate data transformations with before/after comparisons

📜 Data Engineering Cheat Sheet

Quick reference for your daily ETL tasks with Pandas.

🧠 Quick Quiz: Week 4 Concepts

Pandas Patterns Reference

# ═══ IMPORTS ═══
import pandas as pd
import numpy as np

# ═══ CREATION ═══
df = pd.DataFrame(data)              # From dict/list
s = pd.Series(data, name='MyVar')  # 1D Data

# ═══ LOADING ═══
df = pd.read_csv('file.csv', parse_dates=['date'])
df = pd.read_json('file.json')
df = pd.read_sql(query, connection)

# ═══ INSPECTION ═══
df.head(5)         # First 5 rows
df.shape           # (rows, cols)
df.info()          # Types & memory
df.describe()      # Stats
df.isna().sum()    # Count nulls

# ═══ SELECTION ═══
df['col']          # Select column
df[['c1', 'c2']]   # Select multiple
df.loc['id_5']     # Select row by Label
df.iloc[0]         # Select row by Integer Position

🎓 Key Takeaways from Week 4

Pandas Imports: Always use import pandas as pd and import numpy as np. Check versions for reproducibility.

Series vs DataFrame: Series = 1D labeled array (like a column). DataFrame = 2D table (like a spreadsheet). Each DataFrame column IS a Series.

Loading Data: Use pd.read_csv(), pd.read_excel(), pd.read_json(). Always handle errors with try/except.

Inspection First: Always run df.info(), df.describe(), df.isna().sum() before analysis. Check dtypes immediately!

Think Vectorized: Avoid loops. Use column operations. Pandas is optimized for vectorized operations on entire columns.