👋 Welcome to Week 4: Data Engineering with Pandas

This week marks your transition from general Python to specialized data engineering. You'll learn Pandas—the most powerful tool for working with structured data in Python.

🎯 What You'll Learn Today
  • Pandas Architecture — Understand Series (1D) and DataFrame (2D) data structures
  • Data Creation — Build DataFrames from dictionaries, lists, and CSV files
  • ETL Patterns — Extract, Transform, Load data using Pythonic best practices
  • Data Inspection — Use head(), info(), describe() to understand your data
  • Code Quality — Write clean, efficient, professional data engineering code

📚 Key Terms to Know

Series
A one-dimensional labeled array. Like a column in a spreadsheet or a Python list with labels.
DataFrame
A two-dimensional labeled table. Like a spreadsheet or SQL table in Python.
Index
The row labels. Can be numbers (0, 1, 2...) or custom labels (dates, IDs, names).
ETL
Extract, Transform, Load—the process of getting raw data, cleaning it, and preparing it for analysis.
dtype
Data type of a column (int64, float64, object, bool, datetime64). Critical for memory and performance.
Vectorization
Applying operations to entire columns at once instead of looping. Much faster than row-by-row iteration.

🗺️ Today's Topics

Click any card below to jump directly to that section:

🐼
Pandas Setup & Imports
Learn the standard way to import Pandas and verify your environment.
📊
Series: 1D Labeled Data
Master the Series object—labeled arrays that power DataFrame columns.
📋
DataFrames: 2D Tables
Create and manipulate DataFrames—the heart of Pandas data analysis.
📁
Loading External Data
Read CSV, Excel, JSON files into DataFrames with proper error handling.
🔍
Data Inspection & Exploration
Use head(), info(), describe() to understand datasets before analysis.
💡 Think of Pandas Like Excel... But Supercharged

Series = Single Column: One column of data with a label for each row.

DataFrame = Entire Spreadsheet: Multiple columns of different types, all aligned by row.

Index = Row Numbers: But you can use dates, IDs, or any labels instead of just 0, 1, 2...

The Difference? Pandas handles millions of rows, does calculations instantly, and lets you write code to automate everything you'd click in Excel!

🌍 Why Pandas Matters in the Real World

Pandas is the foundation of data science in Python. Every data job you'll encounter uses it:

Data Analysis — Clean, filter, aggregate, and visualize datasets

Machine Learning — Prepare data for scikit-learn, TensorFlow, PyTorch

Business Intelligence — Generate reports, dashboards, and insights

ETL Pipelines — Extract data from APIs/databases, transform it, load to warehouses

Financial Analysis — Time series, stock data, portfolio optimization

📖 How to Use This Lesson

Each topic shows the same problem solved 4 different ways:

🔴 Bad Practice — Code that works but has problems (inefficient, error-prone, unprofessional)

🟠 Novice Solution — Basic working code that beginners write

🔵 Intermediate Solution — Better code with improvements

🟢 Best Practice — Professional, production-ready code

Focus on understanding why each level improves, not just memorizing syntax!

📚 How to Approach This Lesson

1. Think in columns, not rows: Pandas is optimized for column operations. Avoid looping through rows whenever possible.

2. Check types early: Use df.info() and df.dtypes immediately. Wrong types cause bugs and slow performance.

3. Start small: Test operations on df.head() before running on the full dataset.

4. Read error messages: Pandas errors are detailed. They often tell you exactly what's wrong.

5. Practice with real data: Download a CSV from Kaggle or data.gov and explore it using these techniques.

🐼 Pandas Basics & Setup

Starting your data engineering journey begins with the right setup. How you import and initialize Pandas matters.

Python ❌ Bad Practice Star Import
# ❌ BAD: Wildcard import or no alias
# Problem 1: Imports ALL names from pandas (1000+ objects!)
# Problem 2: Overwrites Python built-ins like sum(), min(), max()
# Problem 3: Makes code impossible to debug - where did DataFrame come from?
from pandas import *

# Now these are in your namespace, but you can't tell where they're from:
df = DataFrame({'a': [1, 2]})  # ⚠️ Is this pandas.DataFrame? Hard to tell!

# Even worse - pandas overwrote Python's built-in sum()!
# sum([1, 2, 3])  # Might fail or give unexpected results
⚠️ Why This Is Bad

Namespace pollution: Pandas adds 1000+ names to your global scope, making name collisions likely.

Shadows built-ins: Pandas has its own sum(), min(), max() that override Python's built-in functions.

Unreadable: Other developers (and future you) can't tell where DataFrame came from without checking imports.

Tooling breaks: IDEs and linters can't provide accurate autocomplete or type checking.

Python 🔰 Novice Full Name Import
# 🔰 NOVICE: Importing without alias
# ✓ No namespace pollution
# ✓ Clear where DataFrame comes from
# ✗ Verbose - have to type 'pandas.' every time
import pandas

# Now it's explicit - DataFrame comes from pandas
df = pandas.DataFrame({'a': [1, 2]})
print(df)

# But typing 'pandas.' gets tedious in real projects...
series = pandas.Series([1, 2, 3])  # Too much typing!
✅ What's Better

Safe: No namespace pollution or name conflicts.

Explicit: pandas.DataFrame clearly shows where the class comes from.

But: Writing pandas. hundreds of times in a data analysis script is tedious and hurts readability.

Python 📈 Intermediate Standard Alias
# 📈 INTERMEDIATE: Standard 'pd' alias
# ✓ The universal convention in data science
# ✓ Every data scientist uses 'pd' - instantly recognizable
# ✓ Short enough to not be tedious
import pandas as pd

# Clean, concise, and standard
df = pd.DataFrame({'a': [1, 2]})
print(df)

# 'pd' is so standard, you'll see it in:
# - Every tutorial
# - Every Kaggle notebook
# - Every data science job
✅ What's Better

Universal standard: pd is recognized by every data scientist worldwide.

Concise: Short enough to not clutter code, but explicit enough to be clear.

Tool support: IDEs and linters are optimized for import pandas as pd.

Python ⭐ Best Practice Full Setup
# ⭐ BEST PRACTICE: Standard imports + version verification
# ✓ Standard aliases everyone recognizes
# ✓ Import both pandas and numpy (pandas is built on numpy)
# ✓ Verify versions for reproducibility
import pandas as pd
import numpy as np

# CRITICAL in notebooks: Print versions for reproducibility
# Different pandas versions can behave differently!
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")

# Optional: Set display options for better readability
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.precision', 2)      # 2 decimal places

df = pd.DataFrame({'a': [1, 2]})
print(df)
⭐ Why This Is Best

Version checking: Pandas 2.x has breaking changes from 1.x. Always verify versions in production code.

Numpy included: Pandas is built on NumPy. You'll often need both, so import them together.

Display options: pd.set_option() makes DataFrames more readable in Jupyter notebooks.

Reproducibility: Anyone running your code knows exactly what versions you used.

⚠️ Common Pandas Import Pitfalls

Using non-standard aliases: Don't use import pandas as p or import pandas as panda. Always use pd.

Forgetting numpy: Most Pandas operations need NumPy. Import both at the start.

Not checking versions: Pandas 1.x code might break in 2.x. Always verify versions in production.

Importing specific functions: Don't do from pandas import DataFrame. Always use the pd. prefix for clarity.

💬 Group Discussion Questions

1. Why is using a standard alias like pd important for collaboration? How would non-standard aliases hurt a team project?

2. When would you need to check the pandas version? Have you ever encountered code that broke due to version differences?

3. What other libraries have universally-recognized aliases? (Hint: numpy → np, matplotlib.pyplot → plt...)

4. Why does from pandas import * break things, but import pandas as pd doesn't?

📓 Practice in Notebook

Open notebook-sessions/week4/session1_data_engineering.ipynb and implement standard imports (pd, np), version checks, and display options. Add a cell explaining why version verification matters.

🏋️ Exercise: DataFrame Explorer

Goal: Load a CSV file and explore its contents using Pandas inspection methods.

  • Import pandas with the standard alias pd
  • Load a CSV file using pd.read_csv('your_file.csv')
  • Display the first 10 rows using df.head(10)
  • Show all column data types with df.dtypes
  • Generate summary statistics with df.describe()
  • Print the DataFrame shape: print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

💡 Tip: Try with the sample data in week2/week2_data/ folder!

🚀 Challenge: Environment Verification Script

Create a reusable setup verification function:

  • Write a function verify_environment() that checks Pandas and NumPy versions
  • Print a warning if Pandas version is below 2.0
  • Set optimal display options (max_columns, precision, max_rows)
  • Return a dict with environment info: {'pandas': pd.__version__, 'numpy': np.__version__}
  • Bonus: Add a check for optional dependencies like matplotlib or openpyxl

⭐ Pro tip: This function can be your standard "first cell" in every data analysis notebook!

📊 Pandas Series: 1D Labeled Data

A Series is a one-dimensional labeled array capable of holding any data type. Think of it as a column in a spreadsheet or a Python list with labels instead of just positions.

📖 Series Vocabulary

Series — A 1D array with labeled indices (like a single column)

index — The row labels (can be numbers, strings, dates, etc.)

values — The actual data as a NumPy array

dtype — The data type of the Series (int64, float64, object, etc.)

name — Optional label for the Series (becomes column name in DataFrame)

💡 Series vs Python Lists

Python List: [1.2, 0.5, 3.0] — Access by position: prices[0]

Pandas Series: {'Apple': 1.2, 'Banana': 0.5, 'Cherry': 3.0} — Access by label: prices['Apple']

Why use Series? Labeled indices, automatic alignment, vectorized operations, and missing data handling!

Python ❌ Bad Practice Using Parallel Lists
# ❌ BAD: Using parallel lists for labeled data
products = ['Apple', 'Banana', 'Cherry']
prices = [1.20, 0.50, 3.00]

# Hard to maintain connection between product and price
idx = products.index('Apple')
print(f"Price of {products[idx]}: {prices[idx]}")

# No built-in math operations
# total = sum(prices) # Works, but slow for large data
Python 🔰 Novice Basic Series
# 🔰 NOVICE: Creating a simple Series
import pandas as pd

prices = pd.Series([1.20, 0.50, 3.00])

# Default integer index (0, 1, 2...)
print(prices)
Python 📈 Intermediate Indexed Series
# 📈 INTERMEDIATE: Series with Index
import pandas as pd

# Mapping labels to data
prices = pd.Series(
    [1.20, 0.50, 3.00],
    index=['Apple', 'Banana', 'Cherry']
)

print(prices)
print(prices['Apple'])  # Access by label!
Python ⭐ Best Practice Vectorized Operations
# ⭐ BEST PRACTICE: Named Series & Vectorization
import pandas as pd

prices = pd.Series(
    [1.20, 0.50, 3.00],
    index=['Apple', 'Banana', 'Cherry'],
    name='Fruit Prices'
)

# Vectorized operation: Apply discount to ALL items at once
# No loop needed! Much faster than lists.
discounted = prices * 0.9

print("Discounted Prices:")
print(discounted)
⭐ Pro Tip: Vectorization

Never loop over a Series if you can avoid it! Operations like series * 0.9 happen in C at lightning speed.

🎮 Try It: Create a Series
Challenge

Create a Series representing the population of 3 cities (NY, LA, Chicago) in millions. Then, calculate the population if it grows by 10%.

📓 Practice in Notebook

Open notebook-sessions/week4/session1_data_engineering.ipynb and build a Series with labels. Practice vectorized operations (discounts, taxes) and label-based access with .loc.

🏋️ Exercise: Data Filtering

Goal: Practice filtering Series data based on conditions.

  • Create a Series of 10 stock prices with company ticker symbols as the index
  • Filter stocks priced above $100: expensive = prices[prices > 100]
  • Find stocks in a specific range: mid_range = prices[(prices >= 50) & (prices <= 150)]
  • Use .loc[] to select specific tickers: prices.loc[['AAPL', 'GOOGL', 'MSFT']]
  • Calculate the percentage of stocks above the median price

💡 Tip: Boolean indexing is one of the most powerful Pandas features—master it!

🚀 Challenge: Price Comparison Analysis

Build a complete price analysis using two Series:

  • Create last_week and this_week Series with matching product indices
  • Calculate price changes: change = this_week - last_week
  • Calculate percentage change: pct_change = (change / last_week) * 100
  • Find products that increased in price (filter where change > 0)
  • Find the product with the largest percentage increase
  • Create a summary report showing: avg change, max increase, max decrease

⭐ This mirrors real-world retail analytics and financial reporting!

⚠️ Common Series Pitfalls

Forgetting alignment: When you add two Series with different indices, pandas aligns them automatically—missing indices become NaN!

Confusing .loc vs []: series[0] gets by position (not label!). Use series.iloc[0] for position, series.loc['label'] for labels.

Not checking dtype: A Series with mixed types becomes object dtype—very slow! Always use series.dtype to check.

Looping over Series: Never use for item in series: for calculations. Use vectorized operations instead!

💬 Group Discussion Questions

1. When would you use a Series instead of a Python list? What advantages does it offer?

2. What happens when you perform arithmetic on two Series with different indices? Try it!

3. Why is labeled indexing more powerful than positional indexing? Give a real-world example.

4. How does a Series relate to a DataFrame? (Hint: Each DataFrame column IS a Series!)

📋 DataFrames: 2D Tables

The DataFrame is the heart of Pandas. It's a table where columns can have different types.

Python ❌ Bad Practice List of Lists
# ❌ BAD: List of lists (Matrix style)
data = [
    ['Alice', 25, 'Engineer'],
    ['Bob', 30, 'Designer']
]

# Must remember index 0 is name, 1 is age...
print(data[0][0])  # Alice
# Error-prone and hard to read
Python 🔰 Novice Dict to DataFrame
# 🔰 NOVICE: Dictionary to DataFrame
import pandas as pd

data = {
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'job': ['Engineer', 'Designer']
}

df = pd.DataFrame(data)
print(df)
Python 📈 Intermediate Setting Index
# 📈 INTERMEDIATE: Specify Index
import pandas as pd

data = {
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'job': ['Engineer', 'Designer']
}

# Use ID as index for fast lookups
df = pd.DataFrame(data, index=[101, 102])
df.index.name = 'id'

print(df)
Python ⭐ Best Practice Robust Creation
# ⭐ BEST PRACTICE: Robust creation and inspection
import pandas as pd

# List of dictionaries (row-oriented) is often cleaner for API data
rows = [
    {'id': 101, 'name': 'Alice', 'age': 25, 'job': 'Engineer'},
    {'id': 102, 'name': 'Bob', 'age': 30, 'job': 'Designer'}
]

df = pd.DataFrame(rows)
df.set_index('id', inplace=True)

print(df)
print("\nInfo:")
df.info() # Always verify types!
⚠️ Common DataFrame Pitfalls

Modifying copies: df[df['age'] > 25]['name'] = 'X' might not work! Use df.loc[df['age'] > 25, 'name'] = 'X' instead.

Chained indexing: df['col'][0] = value causes SettingWithCopyWarning. Use df.loc[0, 'col'] = value.

Not using inplace carefully: inplace=True modifies the original DataFrame and returns None. Can cause bugs if you expect a return value!

Assuming order: DataFrames don't guarantee row order unless you explicitly sort. Always use proper indices or sorting.

🎮 Try It: Build Your Own DataFrame
Challenge

Create a DataFrame of at least 3 students with columns: name, grade, and favorite_subject. Set student IDs as the index.

Hint: Use df.set_index('student_id', inplace=True) or create with pd.DataFrame(students).set_index('student_id').
📓 Practice in Notebook

Open notebook-sessions/week4/session1_data_engineering.ipynb and create DataFrames from dicts and rows (list of dicts). Set a meaningful index (e.g., id) and verify with df.info().

💬 Group Discussion Questions

1. Why is using set_index() better than relying on the default 0, 1, 2... indices? When would you use each?

2. What's the difference between creating a DataFrame from a dict of lists vs a list of dicts? Which is more intuitive for row-oriented data (like API responses)?

3. When should you call df.info()? Why is checking dtypes important before starting analysis?

4. How would you explain the relationship between Series and DataFrames to someone new to Pandas?

📓 Practice in Notebook

Open notebook-sessions/week4/session2_data_engineering_group.ipynb and load sample CSV/JSON files. Use parse_dates, dtype, and usecols. Add error handling and print shape.

🏋️ Exercise: Column Operations

Goal: Practice adding, modifying, and managing DataFrame columns.

  • Create a DataFrame with columns: product, price, quantity
  • Add a calculated column: df['total'] = df['price'] * df['quantity']
  • Add a tax column (8.5% tax): df['tax'] = df['total'] * 0.085
  • Rename columns using: df.rename(columns={'price': 'unit_price'}, inplace=True)
  • Add a column with default value: df['currency'] = 'USD'
  • Handle missing values: df['discount'] = df['discount'].fillna(0)

💡 Tip: Always verify changes with df.head() after each operation!

🚀 Challenge: Sales Analysis Dashboard

Build a complete sales analysis from raw data:

  • Create a sales DataFrame with: salesperson, region, product, units_sold, unit_price
  • Calculate total revenue per row: df['revenue'] = df['units_sold'] * df['unit_price']
  • Find total revenue across all sales: total = df['revenue'].sum()
  • Calculate average revenue per sale: avg = df['revenue'].mean()
  • Find the top salesperson: df.groupby('salesperson')['revenue'].sum().idxmax()
  • Find the best-performing region
  • Identify the top 3 highest-revenue transactions
  • Calculate what percentage each region contributes to total revenue

⭐ This is exactly what business analysts do daily—you're building real skills!

📂 Loading Data (ETL)

Extracting data from files is the first step in any pipeline.

Python ❌ Bad Practice Manual Parsing
# ❌ BAD: Manual file reading
with open('data.csv', 'r') as f:
    lines = f.readlines()
    
# Nightmare to handle:
# - Quoted strings with commas ("New York, NY")
# - Missing values
# - Type conversion (everything is a string)
print(lines[:2])
Python 🔰 Novice Basic read_csv
# 🔰 NOVICE: Basic read_csv
import pandas as pd

# Reads the file, but defaults usually aren't enough
df = pd.read_csv('data.csv')

# Problem: Dates are read as strings (object)
print(df.head(1))
Python 📈 Intermediate Parse Dates
# 📈 INTERMEDIATE: Parsing dates and index
import pandas as pd

df = pd.read_csv(
    'data.csv',
    parse_dates=['date'],  # Convert to datetime objects
    index_col='date'       # Use date as index
)

print(df.head(1))
Python ⭐ Best Practice Robust ETL Loading
# ⭐ BEST PRACTICE: Robust configuration
import pandas as pd

# Define types explicitly to save memory and prevent errors
dtypes = {
    'category': 'category',  # Efficient for repeated strings
    'id': 'str',             # IDs should be strings, not numbers
    'value': 'float32'       # Use smaller float if precision allows
}

try:
    df = pd.read_csv(
        'data.csv',
        dtype=dtypes,
        parse_dates=['timestamp'],
        na_values=['n/a', 'missing', '-'],  # Handle custom nulls
        usecols=['id', 'timestamp', 'category', 'value'] # Load only what you need
    )
    print(f"Data loaded successfully. Shape: {df.shape}")
except FileNotFoundError:
    print("Error: File not found.")
⚠️ Common Data Loading Pitfalls

Not handling errors: Always use try/except when loading files. Files can be missing, corrupted, or have wrong formats.

Wrong encoding: CSV files can have different encodings (utf-8, latin-1, cp1252). Use encoding='utf-8' or try different encodings if you see errors.

Not checking shape: Always check df.shape after loading. If you expect 1000 rows but get 10, something went wrong!

Ignoring parse dates: Dates load as strings by default. Use parse_dates=['date_column'] to auto-convert them.

💬 Group Discussion Questions

1. Why is error handling critical when loading external data? What could go wrong?

2. When would you use usecols to load only specific columns? How does this save memory?

3. What's the difference between read_csv() and read_excel()? Which is faster and why?

4. How would you handle a 10GB CSV file that doesn't fit in memory? (Hint: chunksize parameter!)

🏋️ Exercise: Robust Data Loading

Goal: Practice loading data with proper error handling and type specification.

  • Create a CSV file with columns: id, name, date, value, category
  • Include some missing values (empty cells) and custom null markers ('N/A', '-')
  • Load the file with explicit dtype specification for each column
  • Use parse_dates=['date'] to convert date strings automatically
  • Handle custom null values with na_values=['N/A', '-', 'missing']
  • Wrap loading in try/except to handle FileNotFoundError
  • Print df.info() and df.isna().sum() to verify loading worked correctly

💡 Tip: Real-world data ALWAYS has issues—practice defensive loading!

🚀 Challenge: Multi-Format ETL Pipeline

Build a complete ETL function that handles multiple file formats:

  • Create a function load_data(filepath) that auto-detects file type by extension
  • Support .csv, .json, .xlsx formats
  • Use Path(filepath).suffix to detect file extension
  • Apply appropriate reader: pd.read_csv(), pd.read_json(), pd.read_excel()
  • Return both the DataFrame and a metadata dict: {'rows': n, 'cols': m, 'source': filepath}
  • Handle unknown file types with a helpful error message
  • Bonus: Add a validate=True parameter that runs automatic data quality checks

⭐ This pattern is the foundation of production ETL pipelines!

📓 Practice in Notebook

Open notebook-sessions/week4/session2_data_engineering_group.ipynb and perform the inspection trilogy on loaded data: df.shape, df.info(), df.isna().sum(), and df.describe().

🔍 Data Inspection

Trust but verify. Always inspect data immediately after loading.

Python ❌ Bad Practice Print Everything
# ❌ BAD: Printing the whole dataframe
print(df)

# Problems:
# - Floods the console/notebook
# - Slow for large datasets
# - Doesn't show data types or nulls
Python 🔰 Novice Using head()
# 🔰 NOVICE: Just checking the top
print(df.head())

# Better, but misses hidden issues like:
# - Wrong data types (numbers stored as strings)
# - Null values further down
Python ⭐ Best Practice Comprehensive Audit
# ⭐ BEST PRACTICE: The Inspection Trilogy
# 1. Shape (Dimensions)
print(f"SHAPE: {df.shape}")

# 2. Info (Types & Memory)
print("\nTYPES:")
print(df.dtypes)

# 3. Null Audit (Crucial!)
print("\nMISSING:")
print(df.isna().sum())

# 4. Statistics (Distribution)
# print(df.describe())

📜 Data Engineering Cheat Sheet

Quick reference for your daily ETL tasks with Pandas.

🧠 Quick Quiz: Week 4 Concepts

1) Which parameter converts a date column to datetime in read_csv?

2) Select by label vs by position:

3) Which is MOST informative for types and nulls?

Pandas Patterns Reference
# ═══ IMPORTS ═══
import pandas as pd
import numpy as np

# ═══ CREATION ═══
df = pd.DataFrame(data)              # From dict/list
s = pd.Series(data, name='MyVar')  # 1D Data

# ═══ LOADING ═══
df = pd.read_csv('file.csv', parse_dates=['date'])
df = pd.read_json('file.json')
df = pd.read_sql(query, connection)

# ═══ INSPECTION ═══
df.head(5)         # First 5 rows
df.shape           # (rows, cols)
df.info()          # Types & memory
df.describe()      # Stats
df.isna().sum()    # Count nulls

# ═══ SELECTION ═══
df['col']          # Select column
df[['c1', 'c2']]   # Select multiple
df.loc['id_5']     # Select row by Label
df.iloc[0]         # Select row by Integer Position
🎓 Key Takeaways from Week 4

Pandas Imports: Always use import pandas as pd and import numpy as np. Check versions for reproducibility.

Series vs DataFrame: Series = 1D labeled array (like a column). DataFrame = 2D table (like a spreadsheet). Each DataFrame column IS a Series.

Loading Data: Use pd.read_csv(), pd.read_excel(), pd.read_json(). Always handle errors with try/except.

Inspection First: Always run df.info(), df.describe(), df.isna().sum() before analysis. Check dtypes immediately!

Think Vectorized: Avoid loops. Use column operations. Pandas is optimized for vectorized operations on entire columns.