🤖 What is Machine Learning?

🎯 Learning Objectives
  • Understand Supervised Learning (Labels) vs Unsupervised Learning
  • Master the Scikit-Learn API (`fit`, `predict`)
  • Split data into Training and Testing sets
  • Build Regression and Classification models
  • Evaluate model performance (MSE, Accuracy)
📚 Key Vocabulary
Supervised Learning Training a model with labeled data (input-output pairs). The model learns patterns to predict outputs for new inputs.
Features (X) Input variables used to make predictions. Columns in your DataFrame (e.g., square footage, age, temperature).
Target (y) Output variable you're trying to predict. The label or dependent variable (e.g., house price, disease diagnosis).
Training Set Data used to train the model (typically 70-80%). The model learns patterns from this data.
Test Set Data held back for evaluating model performance (typically 20-30%). Never seen by the model during training.
Overfitting When a model learns training data too well, including noise, and performs poorly on new data. Like memorizing vs. understanding.
🎯 Analogy: Machine Learning as Learning Math

Supervised Learning: A teacher gives you math problems with answers (labeled data). You study them to learn patterns, then solve new problems on your own.

Features (X): The numbers in the problem (e.g., "If apples cost $2 each..."). Input variables.

Target (y): The answer you're solving for (e.g., "Total cost = $10"). Output variable.

Training Set: Practice problems you use to learn. The more practice, the better you get.

Test Set: The final exam. Problems you've never seen, testing if you truly learned vs. memorized.

Overfitting: Memorizing answers to practice problems instead of learning the underlying math. You ace practice but fail the exam.

💡 The Concept

Traditional Programming: Input + Rules = Output

Machine Learning: Input + Output = Rules

💡 Learning Strategy: Intro to ML

Start Simple: Master Linear Regression before deep learning. Understand the basics deeply before adding complexity.

Always Split Your Data: Never evaluate on training data. Use train_test_split() religiously.

Visualize Everything: Plot predictions vs. actual values. Seeing errors visually helps you understand model behavior.

Learn Scikit-Learn API: All models follow fit(X_train, y_train)predict(X_test) pattern. Learn it once, apply everywhere.

Focus on Data Quality: Garbage in, garbage out. Clean data beats fancy algorithms every time.

🏗️ The Scikit-Learn Workflow

Almost every model in Scikit-Learn follows the same 3-step pattern.

Python ❌ Bad Practice Testing on Training Data
# ❌ BAD: Training and testing on the same data
from sklearn.neighbors import KNeighborsClassifier

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

model = KNeighborsClassifier(n_neighbors=1)
model.fit(X, y)

# Predicting on data the model has already memorized!
# This tells us nothing about how it handles new data.
preds = model.predict(X)
Python 🔰 Novice Train/Test Split
# 🔰 NOVICE: Splitting data
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100)

# Split data: 75% for training, 25% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
Python ⭐ Best Practice Stratified Split
# ⭐ BEST PRACTICE: Stratify to keep class balance
from sklearn.model_selection import train_test_split

# stratify=y ensures that if y has 90% class A and 10% class B,
# the train and test sets will preserve this ratio.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
🏋️ Exercise: Iris Classification

Build your first classifier using the famous Iris dataset:

  • Load the Iris dataset using from sklearn.datasets import load_iris
  • Explore the data: print the feature names and target names
  • Split the data into training (80%) and testing (20%) sets
  • Train a KNeighborsClassifier with n_neighbors=3
  • Make predictions on the test set and calculate accuracy
  • Try different values of n_neighbors (1, 5, 7) and compare accuracy
Starter Code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# TODO: Split the data
# TODO: Create and train the model
# TODO: Make predictions and evaluate
🏋️ Exercise: Feature Scaling

Practice data preprocessing with StandardScaler:

  • Create a synthetic dataset with features on different scales (e.g., age: 0-100, income: 0-1000000)
  • Apply StandardScaler to normalize the features
  • Compare the mean and standard deviation before and after scaling
  • Important: Fit the scaler on training data only, then transform both train and test
  • Train a LogisticRegression model with and without scaling—compare accuracy
Starter Code
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# Create data with different scales
np.random.seed(42)
age = np.random.randint(18, 80, 100)  # Range: 18-80
income = np.random.randint(20000, 200000, 100)  # Range: 20k-200k
X = np.column_stack([age, income])
y = (income > 100000).astype(int)  # Simple target

# TODO: Split the data
# TODO: Create scaler and fit on X_train only
# TODO: Transform both X_train and X_test
# TODO: Print mean/std before and after scaling

📈 Linear Regression

Predicting a continuous number (like house price).

Python ⭐ Linear Model
from sklearn.linear_model import LinearRegression
import numpy as np

# Data: y = 2x
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

model = LinearRegression()
model.fit(X, y)

# Predict for x = 5
pred = model.predict([[5]])

print(f"Prediction for 5.0: {pred[0]}")
print(f"Coefficient: {model.coef_[0]}")
🏋️ Exercise: House Price Predictor

Build a linear regression model to predict house prices:

  • Load the California Housing dataset: from sklearn.datasets import fetch_california_housing
  • Explore the features (MedInc, HouseAge, AveRooms, etc.)
  • Split data into train/test sets (80/20 split)
  • Train a LinearRegression model
  • Calculate Mean Squared Error (MSE) and R² score on test data
  • Bonus: Try Ridge regression and compare performance
Starter Code
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

print(f"Features: {housing.feature_names}")
print(f"X shape: {X.shape}")

# TODO: Split the data
# TODO: Train LinearRegression
# TODO: Make predictions and calculate MSE, R²
# TODO: Try Ridge regression
🚀 Challenge: Non-Linear Relationships

Extend your regression skills with polynomial features:

  • Create synthetic data with a non-linear relationship: y = x² + noise
  • Try fitting a LinearRegression—observe the poor fit
  • Use PolynomialFeatures(degree=2) to transform your data
  • Fit LinearRegression on the polynomial features
  • Compare R² scores: linear vs polynomial
  • Experiment with degree=3, 4, 5—what happens with too high a degree?

🏷️ Classification

Predicting a category (Spam vs Not Spam).

Python ⭐ Evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Dummy data
X_train = [[0, 0], [1, 1], [0, 1], [1, 0]]
y_train = [0, 1, 1, 0]  # XOR problem

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Test
X_test = [[0, 0], [1, 1]]
y_test = [0, 1]

preds = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, preds)}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, preds))
🏋️ Exercise: Metrics Comparison

Understand when accuracy isn't enough—compare different metrics:

  • Create an imbalanced dataset: 90% class 0, 10% class 1
  • Train a classifier and calculate Accuracy, Precision, Recall, and F1-score
  • Create a "dummy" classifier that always predicts class 0—what's its accuracy?
  • Compare all metrics between your model and the dummy classifier
  • Understand: When is accuracy misleading? When do we care about Recall vs Precision?
  • Use classification_report() for a complete metrics summary
Starter Code
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, 
                             recall_score, f1_score, classification_report)

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, 
                           weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# TODO: Train a classifier
# TODO: Calculate accuracy, precision, recall, F1
# TODO: Compare with a dummy "always predict 0" baseline
# TODO: Print classification_report()
🚀 Challenge: Customer Churn Predictor

Build a complete classification pipeline to predict customer churn:

  • Generate synthetic customer data with features: tenure, monthly_charges, total_charges, contract_length
  • Create a target variable: churn (1) or stay (0) based on logical rules
  • Preprocess: Handle any missing values, scale numerical features
  • Split data with stratification (important for imbalanced churn data!)
  • Train multiple classifiers: LogisticRegression, RandomForest, KNeighbors
  • Compare all three using accuracy, precision, recall, and F1-score
  • Identify which model is best for minimizing customer loss (hint: focus on recall for churners)
  • Visualize the confusion matrix for your best model
Starter Code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic churn data
np.random.seed(42)
n_customers = 1000

data = pd.DataFrame({
    'tenure': np.random.randint(1, 72, n_customers),  # Months
    'monthly_charges': np.random.uniform(20, 100, n_customers),
    'contract_length': np.random.choice([1, 12, 24], n_customers),
})
data['total_charges'] = data['tenure'] * data['monthly_charges']

# Create churn logic: high charges + short tenure + month-to-month = likely churn
churn_prob = (100 - data['tenure']) / 100 + (data['monthly_charges'] / 200)
churn_prob += (data['contract_length'] == 1) * 0.3
data['churn'] = (churn_prob > np.random.uniform(size=n_customers) + 0.5).astype(int)

# TODO: Prepare features (X) and target (y)
# TODO: Split with stratify=y
# TODO: Scale features
# TODO: Train and compare 3 different classifiers
# TODO: Print classification reports and identify best model

🚀 Pipelines

Chain preprocessing and modeling into a single object. Essential for production.

Python ⭐ Scalable ML
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline that:
# 1. Scales the data (StandardScaler)
# 2. Trains a Logistic Regression model
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

# Now we can treat 'pipe' just like a model
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

print(f"Pipeline Score: {score}")
🏋️ Exercise: Building a Complete Pipeline

Create a production-ready pipeline that handles everything:

  • Load the breast cancer dataset: from sklearn.datasets import load_breast_cancer
  • Build a pipeline with: StandardScalerPCA(n_components=10)LogisticRegression
  • Use Pipeline with named steps instead of make_pipeline
  • Train and evaluate the pipeline
  • Access individual steps using pipe.named_steps['scaler']
  • Compare performance with and without PCA dimensionality reduction
Starter Code
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# TODO: Split the data
# TODO: Create Pipeline with named steps
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('classifier', LogisticRegression())
])

# TODO: Fit and evaluate
# TODO: Access named steps and inspect components
🚀 Challenge: Cross-Validation with Pipelines

Combine pipelines with cross-validation for robust model evaluation:

  • Use cross_val_score with your pipeline for 5-fold cross-validation
  • Calculate mean and standard deviation of scores across folds
  • Use GridSearchCV to tune hyperparameters within the pipeline
  • Search over: pca__n_components: [5, 10, 15] and classifier__C: [0.1, 1, 10]
  • Print the best parameters and best score
  • Understand why preprocessing must be inside the pipeline for proper cross-validation

📜 ML Cheat Sheet

Patterns Reference
# ═══ SETUP ═══
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# ═══ MODELS ═══
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

model = LinearRegression()
model.fit(X_train, y_train)    # Train
preds = model.predict(X_test)  # Predict

# ═══ METRICS ═══
from sklearn.metrics import mean_squared_error, accuracy_score
acc = accuracy_score(y_test, preds)
mse = mean_squared_error(y_test, preds)
🎯 Key Takeaways: Intro to Machine Learning
  • Always Split Your Data: Use train_test_split() before training. Never evaluate on training data—it causes overoptimistic metrics.
  • Scikit-Learn API is Consistent: All models follow model.fit(X_train, y_train)model.predict(X_test). Learn it once, apply everywhere.
  • Regression vs. Classification: Regression predicts continuous values (price, temperature). Classification predicts categories (spam/not spam, disease/healthy).
  • Evaluate with Metrics: Use MSE for regression, Accuracy/Precision/Recall for classification. Don't trust training accuracy—always test on holdout data.
  • Beware Overfitting: Complex models memorize training data but fail on new data. Simpler models often generalize better.
  • Features Matter More Than Algorithms: Good feature engineering (selecting, transforming variables) beats fancy algorithms. Garbage in, garbage out.
  • Use Pipelines for Production: Pipeline bundles preprocessing and model, preventing data leakage and ensuring reproducibility.
  • Visualize Predictions: Plot predicted vs. actual values. Visual errors reveal patterns metrics miss (e.g., systematic bias).