🌐 The Internet as Your Dataset

🎯 Learning Objectives
  • Understand HTTP requests (GET, POST)
  • Fetch JSON data using the `requests` library
  • Parse HTML using `BeautifulSoup`
  • Handle rate limits and authentication
  • Scrape data responsibly and ethically
📚 Key Vocabulary
API Application Programming Interface—a structured way to request data from servers, usually returning JSON or XML. The official, documented way to access data.
HTTP Hypertext Transfer Protocol—the foundation of web communication. GET retrieves data, POST sends data, PUT updates, DELETE removes.
JSON JavaScript Object Notation—a lightweight data format using key-value pairs, easily converted to Python dictionaries with json.loads().
Web Scraping Extracting data from HTML web pages not designed for data access. Uses tools like BeautifulSoup to parse HTML structure.
Rate Limiting Restrictions on how many API requests you can make per time period (e.g., 100 requests/hour). Exceeding limits can get you blocked.
robots.txt A file at a website's root (e.g., example.com/robots.txt) specifying which pages can be scraped by bots. Respect it.

🔌 Using APIs

Make HTTP requests with requests library to fetch structured JSON data from APIs. Clean, documented, and respectful.

🕷️ Web Scraping

Parse HTML with BeautifulSoup to extract data from websites without APIs. Powerful but fragile—page changes break your code.

⚖️ Ethics & Legality

Respect robots.txt, rate limits, and terms of service. Scraping without permission can violate laws and get you banned.

🎯 Analogy: API vs Scraping

Using an API: Walking into a library and asking the librarian for a specific book. They hand you a neat catalog card (JSON) with all the information organized.

Web Scraping: Sneaking into the library's back room and photographing pages from books yourself. You get the data, but it's messy, you might miss context, and you could get kicked out if caught.

Rate Limiting: The librarian says, "You can only ask for 100 books per hour. Come back tomorrow if you need more."

Authentication (API Keys): The library gives you a member card (API key) proving you're authorized to request data.

robots.txt: A sign outside the library saying, "Please don't photograph books from the rare manuscripts section."

💡 API vs Scraping

API (Application Programming Interface): The "front door" for data. Structured, documented, and authorized.

Web Scraping: The "back door". Extracting data from HTML meant for humans. Fragile but powerful.

Real-World Applications:

  • Weather Apps: Fetch current weather from OpenWeatherMap API
  • Financial Analysis: Pull stock prices from Yahoo Finance API or scrape market data
  • Social Media Monitoring: Collect tweets via Twitter API (now X API) for sentiment analysis
  • E-commerce: Scrape competitor prices to dynamically adjust your pricing strategy
💡 Learning Strategy: APIs & Scraping

Always Try the API First: Before scraping, check if the site has an API. It's faster, more reliable, and legal.

Use Browser DevTools: Open Chrome/Firefox DevTools → Network tab to see API requests websites make internally. Reverse-engineer hidden APIs.

Respect Rate Limits: Add time.sleep() between requests. Getting IP-banned wastes more time than waiting a few seconds.

Save Raw Data First: Save JSON responses or HTML to files before parsing. If parsing fails, you don't have to re-fetch.

Read robots.txt: Visit example.com/robots.txt to see what's allowed. Ignoring it can get you sued or blocked.

📓 Practice in Notebook

Open notebook-sessions/week7/session1_apis_scraping.ipynb and summarize API vs scraping with one example each. Note ethical considerations.

🔌 Connecting to APIs

The requests library is the gold standard for HTTP in Python.

Python ❌ Bad Practice No Error Handling
# ❌ BAD: Assuming everything works
import requests
import json

# If the internet is down or URL is wrong, this crashes
response = requests.get('https://api.fake-site.com/data')

# If response is 404 (Not Found), this crashes
data = json.loads(response.text)

print(data['results'])
Python 🔰 Novice Basic Check
# 🔰 NOVICE: Checking status code
import requests

url = 'https://jsonplaceholder.typicode.com/posts/1'
response = requests.get(url)

if response.status_code == 200:
    # Built-in JSON decoder
    data = response.json()
    print(f"Status Code: {response.status_code}")
    print(data)
else:
    print("Something went wrong")
Python ⭐ Best Practice Robust Handling
# ⭐ BEST PRACTICE: Try/Except & raise_for_status
import requests
from requests.exceptions import RequestException

url = 'https://jsonplaceholder.typicode.com/posts/1'

try:
    # Set timeout to prevent hanging forever
    response = requests.get(url, timeout=5)
    
    # Raises error for 4xx or 5xx status codes
    response.raise_for_status()
    
    data = response.json()
    print(f"Success: Fetched {len(data) if isinstance(data, list) else 1} posts")

except requests.exceptions.HTTPError as errh:
    print(f"Http Error: {errh}")
except requests.exceptions.ConnectionError as errc:
    print(f"Error Connecting: {errc}")
except requests.exceptions.Timeout as errt:
    print(f"Timeout Error: {errt}")
except RequestException as e:
    print(f"OOPS: {e}")
🏋️ Exercise: API Explorer

Description: Make GET requests to JSONPlaceholder, a free fake API for testing.

  • Use requests.get() to fetch data from https://jsonplaceholder.typicode.com/posts
  • Print the first 3 posts with their titles and body text
  • Fetch a single user from /users/1 and display their name, email, and company
  • Use raise_for_status() to handle potential errors
  • Add a timeout of 5 seconds to your requests

Bonus: Use query parameters to filter posts by userId: params={'userId': 1}

🏋️ Exercise: Weather Data Parser

Description: Parse nested JSON responses from a weather-like API structure.

  • Create a sample nested JSON structure simulating weather data with location, current, and forecast keys
  • Extract the current temperature and weather condition from the nested structure
  • Loop through a 5-day forecast array and print each day's high and low temperatures
  • Handle missing keys gracefully using .get() with default values
  • Convert the parsed data into a Pandas DataFrame

Bonus: Use json.dumps(data, indent=2) to pretty-print JSON for debugging.

🏋️ Exercise: GitHub API

Description: Work with authenticated API requests using the GitHub API.

  • Create a personal access token on GitHub (Settings → Developer Settings → Tokens)
  • Store the token in an environment variable (os.getenv('GITHUB_TOKEN'))
  • Make an authenticated request to https://api.github.com/user using the Authorization header
  • Fetch your public repositories and display their names and star counts
  • Handle 401 (unauthorized) and 403 (rate limit) errors appropriately

Bonus: Check remaining rate limits using response.headers['X-RateLimit-Remaining'].

⚠️ Common Pitfalls: APIs
  • Ignoring Status Codes: Always check response.status_code. A 200 means success, but 401 (unauthorized), 429 (rate limit), or 500 (server error) require different handling.
  • Not Handling Timeouts: Network requests can hang forever. Always set timeout=10 in requests.get() to prevent infinite waits.
  • Hardcoding API Keys: Never commit API keys to GitHub. Use environment variables (os.getenv('API_KEY')) or config files in .gitignore.
  • Exceeding Rate Limits: APIs limit requests per time period. Add time.sleep() between requests or use libraries like ratelimit to stay compliant.
💬 Discussion Questions
  1. Why do companies provide free APIs? What are the business reasons behind rate limits and API keys?
  2. What's the difference between authentication (proving who you are) and authorization (what you're allowed to do)? How do API keys relate?
  3. When would you use GET vs. POST requests? Give examples of data operations that require each.
  4. How can you handle API responses when the service is temporarily down (e.g., 503 errors)? What retry strategies make sense?
🎮 Practice Playground: Weather Dashboard

Challenge: Build a weather data fetcher using OpenWeatherMap API (free tier):

  • Sign up for an API key at openweathermap.org
  • Write a function that takes a city name and returns current temperature, humidity, and weather description
  • Handle errors (invalid city, network timeout, API key issues)
  • Fetch weather for 5 different cities and save results to a CSV file

Bonus: Use requests.Session() for connection pooling when making multiple requests.

📓 Practice in Notebook

Open notebook-sessions/week7/session1_apis_scraping.ipynb and call an open API (e.g., JSONPlaceholder). Use raise_for_status() and handle Timeout.

🕸️ Web Scraping

When there is no API, use BeautifulSoup to parse HTML.

Python ⭐ Beautiful Soup
from bs4 import BeautifulSoup

# Simulate HTML content
html_doc = """
<html>
    <head><title>My Blog</title></head>
    <body>
        <h1 class="headline">Welcome to Python</h1>
        <p class="content">Scraping is fun!</p>
        <a href="http://example.com" id="link1">Link</a>
    </body>
</html>
"""

# Parse HTML
soup = BeautifulSoup(html_doc, 'html.parser')

# Extract data
title = soup.title.string
headline = soup.find('h1', class_='headline').text
link = soup.find('a')['href']

print(f"Title: {title}")
print(f"Headline: {headline}")
🚀 Challenge: News Headlines Scraper

Description: Build a BeautifulSoup scraper to extract headlines from a practice news site.

  • Target https://quotes.toscrape.com (a safe practice scraping site)
  • Extract all quotes, their authors, and associated tags from the homepage
  • Use find_all() to locate all quote containers
  • Handle missing tags gracefully (check for None before accessing .text)
  • Store results in a list of dictionaries with keys: quote, author, tags
  • Export the scraped data to a CSV file using Pandas

Advanced Challenge:

  • Implement pagination by finding and following the "Next" button link
  • Scrape all 10 pages of quotes
  • Add time.sleep(1) between page requests to be polite
  • Create a function scrape_page(url) that returns a list of quotes from a single page
⚠️ Common Pitfalls: Web Scraping
  • Brittle Selectors: CSS selectors like .class-name break when websites redesign. Your scraper needs constant maintenance—expect it.
  • Ignoring robots.txt: Check website.com/robots.txt before scraping. Violating it can lead to legal issues or permanent IP bans.
  • No Rate Limiting: Hammering a server with requests crashes it and gets you blocked. Add time.sleep(1) between requests as a courtesy.
  • Not Handling Missing Elements: Use .find() and check for None before accessing .text. Missing elements crash your script without defensive coding.
💬 Discussion Questions
  1. Why is web scraping considered "fragile"? What happens when a website redesigns its HTML structure?
  2. How does BeautifulSoup's find() differ from find_all()? When would you use each?
  3. What strategies can you use to make scrapers more resilient to HTML changes (e.g., multiple fallback selectors)?
  4. Why is it important to add delays between scraping requests? What's a reasonable delay for most websites?
🎮 Practice Playground: News Aggregator

Challenge: Scrape headlines from a news website (choose a simple one like quotes.toscrape.com for practice):

  • Fetch the homepage HTML using requests.get()
  • Parse the HTML with BeautifulSoup
  • Extract all article titles, authors, and timestamps
  • Save results to a Pandas DataFrame and export to CSV
  • Handle missing data gracefully (some articles may lack authors)

Bonus: Scrape multiple pages by following "Next Page" links in a loop.

📓 Practice in Notebook

Open notebook-sessions/week7/session1_apis_scraping.ipynb and parse simple HTML with BeautifulSoup. Extract title and links. Save HTML locally first.

⚖️ Ethics & Best Practices

Just because you can scrape it, doesn't mean you should.

⚠️ The Rules of the Road
  • Check robots.txt: (e.g., google.com/robots.txt) to see what is allowed.
  • Rate Limiting: Don't hammer the server. Sleep between requests (`time.sleep(1)`).
  • User-Agent: Identify your bot script in the headers.
  • Public Data Only: Do not scrape data behind a login unless authorized.
Python ⭐ Polite Scraper
import time
import requests

headers = {
    'User-Agent': 'MyStudentBot/1.0 (education purposes)'
}

urls = ['http://site.com/page1', 'http://site.com/page2']

for url in urls:
    try:
        # Be polite! Identify yourself
        response = requests.get(url, headers=headers)
        # ... process data ...
        
        # Be polite! Wait between requests
        time.sleep(1)
        
    except Exception as e:
        print(e)
🏋️ Exercise: Robust API Client

Description: Build a production-ready API client that handles errors, timeouts, and rate limits gracefully.

  • Create a function fetch_with_retry(url, max_retries=3) that retries failed requests
  • Implement exponential backoff: wait 1s, 2s, 4s between retries
  • Handle specific HTTP status codes:
    • 429 Too Many Requests: Wait and retry based on Retry-After header
    • 500-503 Server Errors: Retry with backoff
    • 401/403 Auth Errors: Log error and don't retry
  • Add timeout handling with timeout=10 seconds
  • Log all attempts and failures for debugging

Bonus: Use the tenacity library for advanced retry logic with decorators.

🚀 Challenge: Rate-Limited Data Pipeline

Description: Create a complete data pipeline that respects API rate limits.

  • Build a pipeline that fetches 100 posts from JSONPlaceholder API
  • Implement a rate limiter: maximum 10 requests per minute
  • Use requests.Session() for connection pooling
  • Add logging to track request timing and any errors
  • Save progress to a checkpoint file so you can resume if interrupted
  • Export final results to both JSON and CSV formats

Advanced: Use asyncio with aiohttp for concurrent requests while still respecting rate limits.

⚠️ Common Pitfalls: Ethics & Legality
  • Ignoring Terms of Service: Most websites' ToS explicitly forbid scraping. Violating ToS can lead to legal action—read them before scraping.
  • Scraping Personal Data: GDPR and CCPA regulate personal data collection. Scraping user profiles, emails, or private info can violate privacy laws.
  • Not Respecting robots.txt: This file specifies what bots can access. Ignoring it is unethical and may be illegal under CFAA (Computer Fraud and Abuse Act).
  • Overwhelming Small Servers: Small websites can't handle thousands of requests per second. Your scraper could cause a denial-of-service (DoS)—rate limit yourself.
💬 Discussion Questions
  1. Where is the ethical line between "public data" and "data you shouldn't scrape"? Is scraping social media profiles ethical?
  2. What legal precedents exist around web scraping? Research the hiQ vs. LinkedIn case and its implications.
  3. How do you balance "data is publicly visible" with "the website owner doesn't want you scraping it"?
  4. What's the difference between scraping for personal research vs. commercial use? Why does intent matter legally?
📓 Practice in Notebook

Open notebook-sessions/week7/session2_apis_scraping_group.ipynb and implement polite scraping with headers, sleep, and robots.txt checks.

📜 Scraping Cheat Sheet

Patterns Reference
# ═══ REQUESTS ═══
import requests
resp = requests.get(url, headers=headers, params={'q': 'search'})
data = resp.json()          # Parse JSON response
html = resp.text            # Get raw text/HTML
resp.raise_for_status()     # Check for errors

# ═══ BEAUTIFUL SOUP ═══
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

el = soup.find('div', id='main')      # Find one
els = soup.find_all('a', class_='link') # Find all

text = el.text.strip()      # Get text content
href = el['href']           # Get attribute
🧠 Quick Quiz: Week 7 Concepts

1) Which method raises an HTTP error for non-2xx codes?

2) Which BeautifulSoup parser argument is used in examples?

3) Which header identifies your bot?

🎯 Key Takeaways: APIs & Web Scraping
  • Try APIs First: Always check if a website offers an API before scraping. APIs are faster, more reliable, and legally safer.
  • HTTP Status Codes Matter: Check response.status_code—200 is success, 401 is unauthorized, 429 is rate limit, 500 is server error.
  • Secure Your API Keys: Never hardcode keys in code or commit them to GitHub. Use environment variables with os.getenv().
  • Rate Limit Yourself: Add time.sleep() between requests. Exceeding limits gets you IP-banned and is disrespectful to servers.
  • Web Scraping is Fragile: HTML changes break scrapers. Expect constant maintenance and use defensive coding (check for None).
  • Respect robots.txt: Check website.com/robots.txt to see what's allowed. Ignoring it can lead to legal consequences.
  • Ethics are Critical: Scraping personal data or violating ToS can violate laws (GDPR, CCPA, CFAA). Always research legality first.
  • Save Raw Data First: Store JSON/HTML before parsing. If parsing fails, you won't need to re-fetch, saving time and server load.