🤖 What is Machine Learning?
- Understand Supervised Learning (Labels) vs Unsupervised Learning
- Master the Scikit-Learn API (`fit`, `predict`)
- Split data into Training and Testing sets
- Build Regression and Classification models
- Evaluate model performance (MSE, Accuracy)
Supervised Learning: A teacher gives you math problems with answers (labeled data). You study them to learn patterns, then solve new problems on your own.
Features (X): The numbers in the problem (e.g., "If apples cost $2 each..."). Input variables.
Target (y): The answer you're solving for (e.g., "Total cost = $10"). Output variable.
Training Set: Practice problems you use to learn. The more practice, the better you get.
Test Set: The final exam. Problems you've never seen, testing if you truly learned vs. memorized.
Overfitting: Memorizing answers to practice problems instead of learning the underlying math. You ace practice but fail the exam.
Traditional Programming: Input + Rules = Output
Machine Learning: Input + Output = Rules
Start Simple: Master Linear Regression before deep learning. Understand the basics deeply before adding complexity.
Always Split Your Data: Never evaluate on training data. Use train_test_split() religiously.
Visualize Everything: Plot predictions vs. actual values. Seeing errors visually helps you understand model behavior.
Learn Scikit-Learn API: All models follow fit(X_train, y_train) → predict(X_test) pattern. Learn it once, apply everywhere.
Focus on Data Quality: Garbage in, garbage out. Clean data beats fancy algorithms every time.
🏗️ The Scikit-Learn Workflow
Almost every model in Scikit-Learn follows the same 3-step pattern.
# ❌ BAD: Training and testing on the same data
from sklearn.neighbors import KNeighborsClassifier
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X, y)
# Predicting on data the model has already memorized!
# This tells us nothing about how it handles new data.
preds = model.predict(X)
# 🔰 NOVICE: Splitting data
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100)
# Split data: 75% for training, 25% for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
# ⭐ BEST PRACTICE: Stratify to keep class balance
from sklearn.model_selection import train_test_split
# stratify=y ensures that if y has 90% class A and 10% class B,
# the train and test sets will preserve this ratio.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
Build your first classifier using the famous Iris dataset:
- Load the Iris dataset using
from sklearn.datasets import load_iris - Explore the data: print the feature names and target names
- Split the data into training (80%) and testing (20%) sets
- Train a
KNeighborsClassifierwithn_neighbors=3 - Make predictions on the test set and calculate accuracy
- Try different values of
n_neighbors(1, 5, 7) and compare accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# TODO: Split the data
# TODO: Create and train the model
# TODO: Make predictions and evaluate
Practice data preprocessing with StandardScaler:
- Create a synthetic dataset with features on different scales (e.g., age: 0-100, income: 0-1000000)
- Apply
StandardScalerto normalize the features - Compare the mean and standard deviation before and after scaling
- Important: Fit the scaler on training data only, then transform both train and test
- Train a LogisticRegression model with and without scaling—compare accuracy
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
# Create data with different scales
np.random.seed(42)
age = np.random.randint(18, 80, 100) # Range: 18-80
income = np.random.randint(20000, 200000, 100) # Range: 20k-200k
X = np.column_stack([age, income])
y = (income > 100000).astype(int) # Simple target
# TODO: Split the data
# TODO: Create scaler and fit on X_train only
# TODO: Transform both X_train and X_test
# TODO: Print mean/std before and after scaling
📈 Linear Regression
Predicting a continuous number (like house price).
from sklearn.linear_model import LinearRegression
import numpy as np
# Data: y = 2x
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])
model = LinearRegression()
model.fit(X, y)
# Predict for x = 5
pred = model.predict([[5]])
print(f"Prediction for 5.0: {pred[0]}")
print(f"Coefficient: {model.coef_[0]}")
Build a linear regression model to predict house prices:
- Load the California Housing dataset:
from sklearn.datasets import fetch_california_housing - Explore the features (MedInc, HouseAge, AveRooms, etc.)
- Split data into train/test sets (80/20 split)
- Train a
LinearRegressionmodel - Calculate Mean Squared Error (MSE) and R² score on test data
- Bonus: Try
Ridgeregression and compare performance
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
print(f"Features: {housing.feature_names}")
print(f"X shape: {X.shape}")
# TODO: Split the data
# TODO: Train LinearRegression
# TODO: Make predictions and calculate MSE, R²
# TODO: Try Ridge regression
Extend your regression skills with polynomial features:
- Create synthetic data with a non-linear relationship:
y = x² + noise - Try fitting a LinearRegression—observe the poor fit
- Use
PolynomialFeatures(degree=2)to transform your data - Fit LinearRegression on the polynomial features
- Compare R² scores: linear vs polynomial
- Experiment with degree=3, 4, 5—what happens with too high a degree?
🏷️ Classification
Predicting a category (Spam vs Not Spam).
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# Dummy data
X_train = [[0, 0], [1, 1], [0, 1], [1, 0]]
y_train = [0, 1, 1, 0] # XOR problem
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Test
X_test = [[0, 0], [1, 1]]
y_test = [0, 1]
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds)}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, preds))
Understand when accuracy isn't enough—compare different metrics:
- Create an imbalanced dataset: 90% class 0, 10% class 1
- Train a classifier and calculate Accuracy, Precision, Recall, and F1-score
- Create a "dummy" classifier that always predicts class 0—what's its accuracy?
- Compare all metrics between your model and the dummy classifier
- Understand: When is accuracy misleading? When do we care about Recall vs Precision?
- Use
classification_report()for a complete metrics summary
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score, classification_report)
# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2,
weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# TODO: Train a classifier
# TODO: Calculate accuracy, precision, recall, F1
# TODO: Compare with a dummy "always predict 0" baseline
# TODO: Print classification_report()
Build a complete classification pipeline to predict customer churn:
- Generate synthetic customer data with features: tenure, monthly_charges, total_charges, contract_length
- Create a target variable: churn (1) or stay (0) based on logical rules
- Preprocess: Handle any missing values, scale numerical features
- Split data with stratification (important for imbalanced churn data!)
- Train multiple classifiers: LogisticRegression, RandomForest, KNeighbors
- Compare all three using accuracy, precision, recall, and F1-score
- Identify which model is best for minimizing customer loss (hint: focus on recall for churners)
- Visualize the confusion matrix for your best model
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Generate synthetic churn data
np.random.seed(42)
n_customers = 1000
data = pd.DataFrame({
'tenure': np.random.randint(1, 72, n_customers), # Months
'monthly_charges': np.random.uniform(20, 100, n_customers),
'contract_length': np.random.choice([1, 12, 24], n_customers),
})
data['total_charges'] = data['tenure'] * data['monthly_charges']
# Create churn logic: high charges + short tenure + month-to-month = likely churn
churn_prob = (100 - data['tenure']) / 100 + (data['monthly_charges'] / 200)
churn_prob += (data['contract_length'] == 1) * 0.3
data['churn'] = (churn_prob > np.random.uniform(size=n_customers) + 0.5).astype(int)
# TODO: Prepare features (X) and target (y)
# TODO: Split with stratify=y
# TODO: Scale features
# TODO: Train and compare 3 different classifiers
# TODO: Print classification reports and identify best model
🚀 Pipelines
Chain preprocessing and modeling into a single object. Essential for production.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Create a pipeline that:
# 1. Scales the data (StandardScaler)
# 2. Trains a Logistic Regression model
pipe = make_pipeline(
StandardScaler(),
LogisticRegression()
)
# Now we can treat 'pipe' just like a model
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
print(f"Pipeline Score: {score}")
Create a production-ready pipeline that handles everything:
- Load the breast cancer dataset:
from sklearn.datasets import load_breast_cancer - Build a pipeline with:
StandardScaler→PCA(n_components=10)→LogisticRegression - Use
Pipelinewith named steps instead ofmake_pipeline - Train and evaluate the pipeline
- Access individual steps using
pipe.named_steps['scaler'] - Compare performance with and without PCA dimensionality reduction
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# TODO: Split the data
# TODO: Create Pipeline with named steps
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', LogisticRegression())
])
# TODO: Fit and evaluate
# TODO: Access named steps and inspect components
Combine pipelines with cross-validation for robust model evaluation:
- Use
cross_val_scorewith your pipeline for 5-fold cross-validation - Calculate mean and standard deviation of scores across folds
- Use
GridSearchCVto tune hyperparameters within the pipeline - Search over:
pca__n_components: [5, 10, 15]andclassifier__C: [0.1, 1, 10] - Print the best parameters and best score
- Understand why preprocessing must be inside the pipeline for proper cross-validation
📜 ML Cheat Sheet
# ═══ SETUP ═══
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# ═══ MODELS ═══
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
model = LinearRegression()
model.fit(X_train, y_train) # Train
preds = model.predict(X_test) # Predict
# ═══ METRICS ═══
from sklearn.metrics import mean_squared_error, accuracy_score
acc = accuracy_score(y_test, preds)
mse = mean_squared_error(y_test, preds)
- Always Split Your Data: Use
train_test_split()before training. Never evaluate on training data—it causes overoptimistic metrics. - Scikit-Learn API is Consistent: All models follow
model.fit(X_train, y_train)→model.predict(X_test). Learn it once, apply everywhere. - Regression vs. Classification: Regression predicts continuous values (price, temperature). Classification predicts categories (spam/not spam, disease/healthy).
- Evaluate with Metrics: Use MSE for regression, Accuracy/Precision/Recall for classification. Don't trust training accuracy—always test on holdout data.
- Beware Overfitting: Complex models memorize training data but fail on new data. Simpler models often generalize better.
- Features Matter More Than Algorithms: Good feature engineering (selecting, transforming variables) beats fancy algorithms. Garbage in, garbage out.
- Use Pipelines for Production:
Pipelinebundles preprocessing and model, preventing data leakage and ensuring reproducibility. - Visualize Predictions: Plot predicted vs. actual values. Visual errors reveal patterns metrics miss (e.g., systematic bias).