👋 Welcome to Week 5: Data Wrangling & Aggregation
This week you'll dive deep into Pandas data manipulation. Building on your Week 4 foundation, you'll learn to select specific data subsets, aggregate data by groups, merge multiple datasets, and transform columns for analysis.
- Data Selection — Master .loc[], .iloc[], boolean filtering, and column selection
- Aggregation — Use groupby(), agg(), sum, mean, count to summarize data
- Merging — Combine DataFrames with merge() and different join types
- Transformation — Create and modify columns using .map() and operators
- Pivot Tables — Reorganize data for better analysis and presentation
📚 Key Terms to Know
🗺️ Today's Topics
Click any card below to jump directly to that section:
Series = Single Column: One column of data with a label for each row.
DataFrame = Entire Spreadsheet: Multiple columns of different types, all aligned by row.
Index = Row Numbers: But you can use dates, IDs, or any labels instead of just 0, 1, 2...
The Difference? Pandas handles millions of rows, does calculations instantly, and lets you write code to automate everything you'd click in Excel!
Pandas is the foundation of data science in Python. Every data job you'll encounter uses it:
• Data Analysis — Clean, filter, aggregate, and visualize datasets
• Machine Learning — Prepare data for scikit-learn, TensorFlow, PyTorch
• Business Intelligence — Generate reports, dashboards, and insights
• ETL Pipelines — Extract data from APIs/databases, transform it, load to warehouses
• Financial Analysis — Time series, stock data, portfolio optimization
Each topic shows the same problem solved 4 different ways:
🔴 Bad Practice — Code that works but has problems (inefficient, error-prone, unprofessional)
🟠 Novice Solution — Basic working code that beginners write
🔵 Intermediate Solution — Better code with improvements
🟢 Best Practice — Professional, production-ready code
Focus on understanding why each level improves, not just memorizing syntax!
1. Think in columns, not rows: Pandas is optimized for column operations. Avoid looping through rows whenever possible.
2. Check types early: Use df.info() and df.dtypes immediately. Wrong types cause bugs and slow performance.
3. Start small: Test operations on df.head() before running on the full dataset.
4. Read error messages: Pandas errors are detailed. They often tell you exactly what's wrong.
5. Practice with real data: Download a CSV from Kaggle or data.gov and explore it using these techniques.
🐼 Pandas Basics & Setup
Starting your data engineering journey begins with the right setup. How you import and initialize Pandas matters.
# ❌ BAD: Wildcard import or no alias
# Problem 1: Imports ALL names from pandas (1000+ objects!)
# Problem 2: Overwrites Python built-ins like sum(), min(), max()
# Problem 3: Makes code impossible to debug - where did DataFrame come from?
from pandas import *
# Now these are in your namespace, but you can't tell where they're from:
df = DataFrame({'a': [1, 2]}) # ⚠️ Is this pandas.DataFrame? Hard to tell!
# Even worse - pandas overwrote Python's built-in sum()!
# sum([1, 2, 3]) # Might fail or give unexpected results
Namespace pollution: Pandas adds 1000+ names to your global scope, making name collisions likely.
Shadows built-ins: Pandas has its own sum(), min(), max() that override Python's built-in functions.
Unreadable: Other developers (and future you) can't tell where DataFrame came from without checking imports.
Tooling breaks: IDEs and linters can't provide accurate autocomplete or type checking.
# 🔰 NOVICE: Importing without alias
# ✓ No namespace pollution
# ✓ Clear where DataFrame comes from
# ✗ Verbose - have to type 'pandas.' every time
import pandas
# Now it's explicit - DataFrame comes from pandas
df = pandas.DataFrame({'a': [1, 2]})
print(df)
# But typing 'pandas.' gets tedious in real projects...
series = pandas.Series([1, 2, 3]) # Too much typing!
Safe: No namespace pollution or name conflicts.
Explicit: pandas.DataFrame clearly shows where the class comes from.
But: Writing pandas. hundreds of times in a data analysis script is tedious and hurts readability.
# 📈 INTERMEDIATE: Standard 'pd' alias
# ✓ The universal convention in data science
# ✓ Every data scientist uses 'pd' - instantly recognizable
# ✓ Short enough to not be tedious
import pandas as pd
# Clean, concise, and standard
df = pd.DataFrame({'a': [1, 2]})
print(df)
# 'pd' is so standard, you'll see it in:
# - Every tutorial
# - Every Kaggle notebook
# - Every data science job
Universal standard: pd is recognized by every data scientist worldwide.
Concise: Short enough to not clutter code, but explicit enough to be clear.
Tool support: IDEs and linters are optimized for import pandas as pd.
# ⭐ BEST PRACTICE: Standard imports + version verification
# ✓ Standard aliases everyone recognizes
# ✓ Import both pandas and numpy (pandas is built on numpy)
# ✓ Verify versions for reproducibility
import pandas as pd
import numpy as np
# CRITICAL in notebooks: Print versions for reproducibility
# Different pandas versions can behave differently!
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")
# Optional: Set display options for better readability
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.precision', 2) # 2 decimal places
df = pd.DataFrame({'a': [1, 2]})
print(df)
Version checking: Pandas 2.x has breaking changes from 1.x. Always verify versions in production code.
Numpy included: Pandas is built on NumPy. You'll often need both, so import them together.
Display options: pd.set_option() makes DataFrames more readable in Jupyter notebooks.
Reproducibility: Anyone running your code knows exactly what versions you used.
Using non-standard aliases: Don't use import pandas as p or import pandas as panda. Always use pd.
Forgetting numpy: Most Pandas operations need NumPy. Import both at the start.
Not checking versions: Pandas 1.x code might break in 2.x. Always verify versions in production.
Importing specific functions: Don't do from pandas import DataFrame. Always use the pd. prefix for clarity.
1. Why is using a standard alias like pd important for collaboration? How would non-standard aliases hurt a team project?
2. When would you need to check the pandas version? Have you ever encountered code that broke due to version differences?
3. What other libraries have universally-recognized aliases? (Hint: numpy → np, matplotlib.pyplot → plt...)
4. Why does from pandas import * break things, but import pandas as pd doesn't?
Open notebook-sessions/week4/session1_data_engineering.ipynb and implement standard imports (pd, np), version checks, and display options. Add a cell explaining why version verification matters.
📊 Pandas Series: 1D Labeled Data
A Series is a one-dimensional labeled array capable of holding any data type. Think of it as a column in a spreadsheet or a Python list with labels instead of just positions.
Series — A 1D array with labeled indices (like a single column)
index — The row labels (can be numbers, strings, dates, etc.)
values — The actual data as a NumPy array
dtype — The data type of the Series (int64, float64, object, etc.)
name — Optional label for the Series (becomes column name in DataFrame)
Python List: [1.2, 0.5, 3.0] — Access by position: prices[0]
Pandas Series: {'Apple': 1.2, 'Banana': 0.5, 'Cherry': 3.0} — Access by label: prices['Apple']
Why use Series? Labeled indices, automatic alignment, vectorized operations, and missing data handling!
# ❌ BAD: Using parallel lists for labeled data
products = ['Apple', 'Banana', 'Cherry']
prices = [1.20, 0.50, 3.00]
# Hard to maintain connection between product and price
idx = products.index('Apple')
print(f"Price of {products[idx]}: {prices[idx]}")
# No built-in math operations
# total = sum(prices) # Works, but slow for large data
# 🔰 NOVICE: Creating a simple Series
import pandas as pd
prices = pd.Series([1.20, 0.50, 3.00])
# Default integer index (0, 1, 2...)
print(prices)
# 📈 INTERMEDIATE: Series with Index
import pandas as pd
# Mapping labels to data
prices = pd.Series(
[1.20, 0.50, 3.00],
index=['Apple', 'Banana', 'Cherry']
)
print(prices)
print(prices['Apple']) # Access by label!
# ⭐ BEST PRACTICE: Named Series & Vectorization
import pandas as pd
prices = pd.Series(
[1.20, 0.50, 3.00],
index=['Apple', 'Banana', 'Cherry'],
name='Fruit Prices'
)
# Vectorized operation: Apply discount to ALL items at once
# No loop needed! Much faster than lists.
discounted = prices * 0.9
print("Discounted Prices:")
print(discounted)
Never loop over a Series if you can avoid it! Operations like series * 0.9 happen in C at lightning speed.
Create a Series representing the population of 3 cities (NY, LA, Chicago) in millions. Then, calculate the population if it grows by 10%.
Open notebook-sessions/week4/session1_data_engineering.ipynb and build a Series with labels. Practice vectorized operations (discounts, taxes) and label-based access with .loc.
Forgetting alignment: When you add two Series with different indices, pandas aligns them automatically—missing indices become NaN!
Confusing .loc vs []: series[0] gets by position (not label!). Use series.iloc[0] for position, series.loc['label'] for labels.
Not checking dtype: A Series with mixed types becomes object dtype—very slow! Always use series.dtype to check.
Looping over Series: Never use for item in series: for calculations. Use vectorized operations instead!
1. When would you use a Series instead of a Python list? What advantages does it offer?
2. What happens when you perform arithmetic on two Series with different indices? Try it!
3. Why is labeled indexing more powerful than positional indexing? Give a real-world example.
4. How does a Series relate to a DataFrame? (Hint: Each DataFrame column IS a Series!)
📋 DataFrames: 2D Tables
The DataFrame is the heart of Pandas. It's a table where columns can have different types.
# ❌ BAD: List of lists (Matrix style)
data = [
['Alice', 25, 'Engineer'],
['Bob', 30, 'Designer']
]
# Must remember index 0 is name, 1 is age...
print(data[0][0]) # Alice
# Error-prone and hard to read
# 🔰 NOVICE: Dictionary to DataFrame
import pandas as pd
data = {
'name': ['Alice', 'Bob'],
'age': [25, 30],
'job': ['Engineer', 'Designer']
}
df = pd.DataFrame(data)
print(df)
# 📈 INTERMEDIATE: Specify Index
import pandas as pd
data = {
'name': ['Alice', 'Bob'],
'age': [25, 30],
'job': ['Engineer', 'Designer']
}
# Use ID as index for fast lookups
df = pd.DataFrame(data, index=[101, 102])
df.index.name = 'id'
print(df)
# ⭐ BEST PRACTICE: Robust creation and inspection
import pandas as pd
# List of dictionaries (row-oriented) is often cleaner for API data
rows = [
{'id': 101, 'name': 'Alice', 'age': 25, 'job': 'Engineer'},
{'id': 102, 'name': 'Bob', 'age': 30, 'job': 'Designer'}
]
df = pd.DataFrame(rows)
df.set_index('id', inplace=True)
print(df)
print("\nInfo:")
df.info() # Always verify types!
Modifying copies: df[df['age'] > 25]['name'] = 'X' might not work! Use df.loc[df['age'] > 25, 'name'] = 'X' instead.
Chained indexing: df['col'][0] = value causes SettingWithCopyWarning. Use df.loc[0, 'col'] = value.
Not using inplace carefully: inplace=True modifies the original DataFrame and returns None. Can cause bugs if you expect a return value!
Assuming order: DataFrames don't guarantee row order unless you explicitly sort. Always use proper indices or sorting.
Create a DataFrame of at least 3 students with columns: name, grade, and favorite_subject. Set student IDs as the index.
df.set_index('student_id', inplace=True) or create with pd.DataFrame(students).set_index('student_id').
Open notebook-sessions/week4/session1_data_engineering.ipynb and create DataFrames from dicts and rows (list of dicts). Set a meaningful index (e.g., id) and verify with df.info().
1. Why is using set_index() better than relying on the default 0, 1, 2... indices? When would you use each?
2. What's the difference between creating a DataFrame from a dict of lists vs a list of dicts? Which is more intuitive for row-oriented data (like API responses)?
3. When should you call df.info()? Why is checking dtypes important before starting analysis?
4. How would you explain the relationship between Series and DataFrames to someone new to Pandas?
Open notebook-sessions/week4/session2_data_engineering_group.ipynb and load sample CSV/JSON files. Use parse_dates, dtype, and usecols. Add error handling and print shape.
📂 Loading Data (ETL)
Extracting data from files is the first step in any pipeline.
# ❌ BAD: Manual file reading
with open('data.csv', 'r') as f:
lines = f.readlines()
# Nightmare to handle:
# - Quoted strings with commas ("New York, NY")
# - Missing values
# - Type conversion (everything is a string)
print(lines[:2])
# 🔰 NOVICE: Basic read_csv
import pandas as pd
# Reads the file, but defaults usually aren't enough
df = pd.read_csv('data.csv')
# Problem: Dates are read as strings (object)
print(df.head(1))
# 📈 INTERMEDIATE: Parsing dates and index
import pandas as pd
df = pd.read_csv(
'data.csv',
parse_dates=['date'], # Convert to datetime objects
index_col='date' # Use date as index
)
print(df.head(1))
# ⭐ BEST PRACTICE: Robust configuration
import pandas as pd
# Define types explicitly to save memory and prevent errors
dtypes = {
'category': 'category', # Efficient for repeated strings
'id': 'str', # IDs should be strings, not numbers
'value': 'float32' # Use smaller float if precision allows
}
try:
df = pd.read_csv(
'data.csv',
dtype=dtypes,
parse_dates=['timestamp'],
na_values=['n/a', 'missing', '-'], # Handle custom nulls
usecols=['id', 'timestamp', 'category', 'value'] # Load only what you need
)
print(f"Data loaded successfully. Shape: {df.shape}")
except FileNotFoundError:
print("Error: File not found.")
Not handling errors: Always use try/except when loading files. Files can be missing, corrupted, or have wrong formats.
Wrong encoding: CSV files can have different encodings (utf-8, latin-1, cp1252). Use encoding='utf-8' or try different encodings if you see errors.
Not checking shape: Always check df.shape after loading. If you expect 1000 rows but get 10, something went wrong!
Ignoring parse dates: Dates load as strings by default. Use parse_dates=['date_column'] to auto-convert them.
1. Why is error handling critical when loading external data? What could go wrong?
2. When would you use usecols to load only specific columns? How does this save memory?
3. What's the difference between read_csv() and read_excel()? Which is faster and why?
4. How would you handle a 10GB CSV file that doesn't fit in memory? (Hint: chunksize parameter!)
Open notebook-sessions/week4/session2_data_engineering_group.ipynb and perform the inspection trilogy on loaded data: df.shape, df.info(), df.isna().sum(), and df.describe().
🔍 Data Inspection
Trust but verify. Always inspect data immediately after loading.
# ❌ BAD: Printing the whole dataframe
print(df)
# Problems:
# - Floods the console/notebook
# - Slow for large datasets
# - Doesn't show data types or nulls
# 🔰 NOVICE: Just checking the top
print(df.head())
# Better, but misses hidden issues like:
# - Wrong data types (numbers stored as strings)
# - Null values further down
# ⭐ BEST PRACTICE: The Inspection Trilogy
# 1. Shape (Dimensions)
print(f"SHAPE: {df.shape}")
# 2. Info (Types & Memory)
print("\nTYPES:")
print(df.dtypes)
# 3. Null Audit (Crucial!)
print("\nMISSING:")
print(df.isna().sum())
# 4. Statistics (Distribution)
# print(df.describe())
You've mastered the basics of Data Engineering with Pandas!