Week 9: Intro to Machine Learning

🤖 What is Machine Learning?

🎯 Learning Objectives

Understand Supervised Learning (Labels) vs Unsupervised Learning
Master the Scikit-Learn API (`fit`, `predict`)
Split data into Training and Testing sets
Build Regression and Classification models
Evaluate model performance (MSE, Accuracy)

📚 Key Vocabulary

Supervised Learning Training a model with labeled data (input-output pairs). The model learns patterns to predict outputs for new inputs.

Features (X) Input variables used to make predictions. Columns in your DataFrame (e.g., square footage, age, temperature).

Target (y) Output variable you're trying to predict. The label or dependent variable (e.g., house price, disease diagnosis).

Training Set Data used to train the model (typically 70-80%). The model learns patterns from this data.

Test Set Data held back for evaluating model performance (typically 20-30%). Never seen by the model during training.

Overfitting When a model learns training data too well, including noise, and performs poorly on new data. Like memorizing vs. understanding.

🎯 Analogy: Machine Learning as Learning Math

Supervised Learning: A teacher gives you math problems with answers (labeled data). You study them to learn patterns, then solve new problems on your own.

Features (X): The numbers in the problem (e.g., "If apples cost $2 each..."). Input variables.

Target (y): The answer you're solving for (e.g., "Total cost = $10"). Output variable.

Training Set: Practice problems you use to learn. The more practice, the better you get.

Test Set: The final exam. Problems you've never seen, testing if you truly learned vs. memorized.

Overfitting: Memorizing answers to practice problems instead of learning the underlying math. You ace practice but fail the exam.

💡 The Concept

Traditional Programming: Input + Rules = Output

Machine Learning: Input + Output = Rules

💡 Learning Strategy: Intro to ML

Start Simple: Master Linear Regression before deep learning. Understand the basics deeply before adding complexity.

Always Split Your Data: Never evaluate on training data. Use train_test_split() religiously.

Visualize Everything: Plot predictions vs. actual values. Seeing errors visually helps you understand model behavior.

Learn Scikit-Learn API: All models follow fit(X_train, y_train) → predict(X_test) pattern. Learn it once, apply everywhere.

Focus on Data Quality: Garbage in, garbage out. Clean data beats fancy algorithms every time.

🏗️ The Scikit-Learn Workflow

Almost every model in Scikit-Learn follows the same 3-step pattern.

Python ❌ Bad Practice Testing on Training Data

# ❌ BAD: Training and testing on the same data
from sklearn.neighbors import KNeighborsClassifier

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

model = KNeighborsClassifier(n_neighbors=1)
model.fit(X, y)

# Predicting on data the model has already memorized!
# This tells us nothing about how it handles new data.
preds = model.predict(X)

Python 🔰 Novice Train/Test Split

# 🔰 NOVICE: Splitting data
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100)

# Split data: 75% for training, 25% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")

Python ⭐ Best Practice Stratified Split

# ⭐ BEST PRACTICE: Stratify to keep class balance
from sklearn.model_selection import train_test_split

# stratify=y ensures that if y has 90% class A and 10% class B,
# the train and test sets will preserve this ratio.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

🏋️ Exercise: Iris Classification

Build your first classifier using the famous Iris dataset:

Load the Iris dataset using from sklearn.datasets import load_iris
Explore the data: print the feature names and target names
Split the data into training (80%) and testing (20%) sets
Train a KNeighborsClassifier with n_neighbors=3
Make predictions on the test set and calculate accuracy
Try different values of n_neighbors (1, 5, 7) and compare accuracy

Starter Code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# TODO: Split the data
# TODO: Create and train the model
# TODO: Make predictions and evaluate

🏋️ Exercise: Feature Scaling

Practice data preprocessing with StandardScaler:

Create a synthetic dataset with features on different scales (e.g., age: 0-100, income: 0-1000000)
Apply StandardScaler to normalize the features
Compare the mean and standard deviation before and after scaling
Important: Fit the scaler on training data only, then transform both train and test
Train a LogisticRegression model with and without scaling—compare accuracy

Starter Code

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# Create data with different scales
np.random.seed(42)
age = np.random.randint(18, 80, 100)  # Range: 18-80
income = np.random.randint(20000, 200000, 100)  # Range: 20k-200k
X = np.column_stack([age, income])
y = (income > 100000).astype(int)  # Simple target

# TODO: Split the data
# TODO: Create scaler and fit on X_train only
# TODO: Transform both X_train and X_test
# TODO: Print mean/std before and after scaling

📈 Linear Regression

Predicting a continuous number (like house price).

Python ⭐ Linear Model

from sklearn.linear_model import LinearRegression
import numpy as np

# Data: y = 2x
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

model = LinearRegression()
model.fit(X, y)

# Predict for x = 5
pred = model.predict([[5]])

print(f"Prediction for 5.0: {pred[0]}")
print(f"Coefficient: {model.coef_[0]}")

🏋️ Exercise: House Price Predictor

Build a linear regression model to predict house prices:

Load the California Housing dataset: from sklearn.datasets import fetch_california_housing
Explore the features (MedInc, HouseAge, AveRooms, etc.)
Split data into train/test sets (80/20 split)
Train a LinearRegression model
Calculate Mean Squared Error (MSE) and R² score on test data
Bonus: Try Ridge regression and compare performance

Starter Code

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

print(f"Features: {housing.feature_names}")
print(f"X shape: {X.shape}")

# TODO: Split the data
# TODO: Train LinearRegression
# TODO: Make predictions and calculate MSE, R²
# TODO: Try Ridge regression

🚀 Challenge: Non-Linear Relationships

Extend your regression skills with polynomial features:

Create synthetic data with a non-linear relationship: y = x² + noise
Try fitting a LinearRegression—observe the poor fit
Use PolynomialFeatures(degree=2) to transform your data
Fit LinearRegression on the polynomial features
Compare R² scores: linear vs polynomial
Experiment with degree=3, 4, 5—what happens with too high a degree?

🏷️ Classification

Predicting a category (Spam vs Not Spam).

Python ⭐ Evaluation

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Dummy data
X_train = [[0, 0], [1, 1], [0, 1], [1, 0]]
y_train = [0, 1, 1, 0]  # XOR problem

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Test
X_test = [[0, 0], [1, 1]]
y_test = [0, 1]

preds = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, preds)}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, preds))

🏋️ Exercise: Metrics Comparison

Understand when accuracy isn't enough—compare different metrics:

Create an imbalanced dataset: 90% class 0, 10% class 1
Train a classifier and calculate Accuracy, Precision, Recall, and F1-score
Create a "dummy" classifier that always predicts class 0—what's its accuracy?
Compare all metrics between your model and the dummy classifier
Understand: When is accuracy misleading? When do we care about Recall vs Precision?
Use classification_report() for a complete metrics summary

Starter Code

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, 
                             recall_score, f1_score, classification_report)

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, 
                           weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# TODO: Train a classifier
# TODO: Calculate accuracy, precision, recall, F1
# TODO: Compare with a dummy "always predict 0" baseline
# TODO: Print classification_report()

🚀 Challenge: Customer Churn Predictor

Build a complete classification pipeline to predict customer churn:

Generate synthetic customer data with features: tenure, monthly_charges, total_charges, contract_length
Create a target variable: churn (1) or stay (0) based on logical rules
Preprocess: Handle any missing values, scale numerical features
Split data with stratification (important for imbalanced churn data!)
Train multiple classifiers: LogisticRegression, RandomForest, KNeighbors
Compare all three using accuracy, precision, recall, and F1-score
Identify which model is best for minimizing customer loss (hint: focus on recall for churners)
Visualize the confusion matrix for your best model

Starter Code

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic churn data
np.random.seed(42)
n_customers = 1000

data = pd.DataFrame({
    'tenure': np.random.randint(1, 72, n_customers),  # Months
    'monthly_charges': np.random.uniform(20, 100, n_customers),
    'contract_length': np.random.choice([1, 12, 24], n_customers),
})
data['total_charges'] = data['tenure'] * data['monthly_charges']

# Create churn logic: high charges + short tenure + month-to-month = likely churn
churn_prob = (100 - data['tenure']) / 100 + (data['monthly_charges'] / 200)
churn_prob += (data['contract_length'] == 1) * 0.3
data['churn'] = (churn_prob > np.random.uniform(size=n_customers) + 0.5).astype(int)

# TODO: Prepare features (X) and target (y)
# TODO: Split with stratify=y
# TODO: Scale features
# TODO: Train and compare 3 different classifiers
# TODO: Print classification reports and identify best model

🚀 Pipelines

Chain preprocessing and modeling into a single object. Essential for production.

Python ⭐ Scalable ML

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline that:
# 1. Scales the data (StandardScaler)
# 2. Trains a Logistic Regression model
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

# Now we can treat 'pipe' just like a model
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

print(f"Pipeline Score: {score}")

🏋️ Exercise: Building a Complete Pipeline

Create a production-ready pipeline that handles everything:

Load the breast cancer dataset: from sklearn.datasets import load_breast_cancer
Build a pipeline with: StandardScaler → PCA(n_components=10) → LogisticRegression
Use Pipeline with named steps instead of make_pipeline
Train and evaluate the pipeline
Access individual steps using pipe.named_steps['scaler']
Compare performance with and without PCA dimensionality reduction

Starter Code

from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# TODO: Split the data
# TODO: Create Pipeline with named steps
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('classifier', LogisticRegression())
])

# TODO: Fit and evaluate
# TODO: Access named steps and inspect components

🚀 Challenge: Cross-Validation with Pipelines

Combine pipelines with cross-validation for robust model evaluation:

Use cross_val_score with your pipeline for 5-fold cross-validation
Calculate mean and standard deviation of scores across folds
Use GridSearchCV to tune hyperparameters within the pipeline
Search over: pca__n_components: [5, 10, 15] and classifier__C: [0.1, 1, 10]
Print the best parameters and best score
Understand why preprocessing must be inside the pipeline for proper cross-validation

📜 ML Cheat Sheet

Patterns Reference

# ═══ SETUP ═══
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# ═══ MODELS ═══
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

model = LinearRegression()
model.fit(X_train, y_train)    # Train
preds = model.predict(X_test)  # Predict

# ═══ METRICS ═══
from sklearn.metrics import mean_squared_error, accuracy_score
acc = accuracy_score(y_test, preds)
mse = mean_squared_error(y_test, preds)

🎯 Key Takeaways: Intro to Machine Learning

Always Split Your Data: Use train_test_split() before training. Never evaluate on training data—it causes overoptimistic metrics.
Scikit-Learn API is Consistent: All models follow model.fit(X_train, y_train) → model.predict(X_test). Learn it once, apply everywhere.
Regression vs. Classification: Regression predicts continuous values (price, temperature). Classification predicts categories (spam/not spam, disease/healthy).
Evaluate with Metrics: Use MSE for regression, Accuracy/Precision/Recall for classification. Don't trust training accuracy—always test on holdout data.
Beware Overfitting: Complex models memorize training data but fail on new data. Simpler models often generalize better.
Features Matter More Than Algorithms: Good feature engineering (selecting, transforming variables) beats fancy algorithms. Garbage in, garbage out.
Use Pipelines for Production: Pipeline bundles preprocessing and model, preventing data leakage and ensuring reproducibility.
Visualize Predictions: Plot predicted vs. actual values. Visual errors reveal patterns metrics miss (e.g., systematic bias).