🌐 The Internet as Your Dataset
- Understand HTTP requests (GET, POST)
- Fetch JSON data using the `requests` library
- Parse HTML using `BeautifulSoup`
- Handle rate limits and authentication
- Scrape data responsibly and ethically
json.loads().
example.com/robots.txt) specifying which pages can be scraped by bots. Respect it.
🔌 Using APIs
Make HTTP requests with requests library to fetch structured JSON data from APIs. Clean, documented, and respectful.
🕷️ Web Scraping
Parse HTML with BeautifulSoup to extract data from websites without APIs. Powerful but fragile—page changes break your code.
⚖️ Ethics & Legality
Respect robots.txt, rate limits, and terms of service. Scraping without permission can violate laws and get you banned.
Using an API: Walking into a library and asking the librarian for a specific book. They hand you a neat catalog card (JSON) with all the information organized.
Web Scraping: Sneaking into the library's back room and photographing pages from books yourself. You get the data, but it's messy, you might miss context, and you could get kicked out if caught.
Rate Limiting: The librarian says, "You can only ask for 100 books per hour. Come back tomorrow if you need more."
Authentication (API Keys): The library gives you a member card (API key) proving you're authorized to request data.
robots.txt: A sign outside the library saying, "Please don't photograph books from the rare manuscripts section."
API (Application Programming Interface): The "front door" for data. Structured, documented, and authorized.
Web Scraping: The "back door". Extracting data from HTML meant for humans. Fragile but powerful.
Real-World Applications:
- Weather Apps: Fetch current weather from OpenWeatherMap API
- Financial Analysis: Pull stock prices from Yahoo Finance API or scrape market data
- Social Media Monitoring: Collect tweets via Twitter API (now X API) for sentiment analysis
- E-commerce: Scrape competitor prices to dynamically adjust your pricing strategy
Always Try the API First: Before scraping, check if the site has an API. It's faster, more reliable, and legal.
Use Browser DevTools: Open Chrome/Firefox DevTools → Network tab to see API requests websites make internally. Reverse-engineer hidden APIs.
Respect Rate Limits: Add time.sleep() between requests. Getting IP-banned wastes more time than waiting a few seconds.
Save Raw Data First: Save JSON responses or HTML to files before parsing. If parsing fails, you don't have to re-fetch.
Read robots.txt: Visit example.com/robots.txt to see what's allowed. Ignoring it can get you sued or blocked.
Open notebook-sessions/week7/session1_apis_scraping.ipynb and summarize API vs scraping with one example each. Note ethical considerations.
🔌 Connecting to APIs
The requests library is the gold standard for HTTP in Python.
# ❌ BAD: Assuming everything works
import requests
import json
# If the internet is down or URL is wrong, this crashes
response = requests.get('https://api.fake-site.com/data')
# If response is 404 (Not Found), this crashes
data = json.loads(response.text)
print(data['results'])
# 🔰 NOVICE: Checking status code
import requests
url = 'https://jsonplaceholder.typicode.com/posts/1'
response = requests.get(url)
if response.status_code == 200:
# Built-in JSON decoder
data = response.json()
print(f"Status Code: {response.status_code}")
print(data)
else:
print("Something went wrong")
# ⭐ BEST PRACTICE: Try/Except & raise_for_status
import requests
from requests.exceptions import RequestException
url = 'https://jsonplaceholder.typicode.com/posts/1'
try:
# Set timeout to prevent hanging forever
response = requests.get(url, timeout=5)
# Raises error for 4xx or 5xx status codes
response.raise_for_status()
data = response.json()
print(f"Success: Fetched {len(data) if isinstance(data, list) else 1} posts")
except requests.exceptions.HTTPError as errh:
print(f"Http Error: {errh}")
except requests.exceptions.ConnectionError as errc:
print(f"Error Connecting: {errc}")
except requests.exceptions.Timeout as errt:
print(f"Timeout Error: {errt}")
except RequestException as e:
print(f"OOPS: {e}")
Description: Make GET requests to JSONPlaceholder, a free fake API for testing.
- Use
requests.get()to fetch data fromhttps://jsonplaceholder.typicode.com/posts - Print the first 3 posts with their titles and body text
- Fetch a single user from
/users/1and display their name, email, and company - Use
raise_for_status()to handle potential errors - Add a timeout of 5 seconds to your requests
Bonus: Use query parameters to filter posts by userId: params={'userId': 1}
Description: Parse nested JSON responses from a weather-like API structure.
- Create a sample nested JSON structure simulating weather data with
location,current, andforecastkeys - Extract the current temperature and weather condition from the nested structure
- Loop through a 5-day forecast array and print each day's high and low temperatures
- Handle missing keys gracefully using
.get()with default values - Convert the parsed data into a Pandas DataFrame
Bonus: Use json.dumps(data, indent=2) to pretty-print JSON for debugging.
Description: Work with authenticated API requests using the GitHub API.
- Create a personal access token on GitHub (Settings → Developer Settings → Tokens)
- Store the token in an environment variable (
os.getenv('GITHUB_TOKEN')) - Make an authenticated request to
https://api.github.com/userusing theAuthorizationheader - Fetch your public repositories and display their names and star counts
- Handle 401 (unauthorized) and 403 (rate limit) errors appropriately
Bonus: Check remaining rate limits using response.headers['X-RateLimit-Remaining'].
- Ignoring Status Codes: Always check
response.status_code. A 200 means success, but 401 (unauthorized), 429 (rate limit), or 500 (server error) require different handling. - Not Handling Timeouts: Network requests can hang forever. Always set
timeout=10inrequests.get()to prevent infinite waits. - Hardcoding API Keys: Never commit API keys to GitHub. Use environment variables (
os.getenv('API_KEY')) or config files in.gitignore. - Exceeding Rate Limits: APIs limit requests per time period. Add
time.sleep()between requests or use libraries likeratelimitto stay compliant.
- Why do companies provide free APIs? What are the business reasons behind rate limits and API keys?
- What's the difference between authentication (proving who you are) and authorization (what you're allowed to do)? How do API keys relate?
- When would you use GET vs. POST requests? Give examples of data operations that require each.
- How can you handle API responses when the service is temporarily down (e.g., 503 errors)? What retry strategies make sense?
Challenge: Build a weather data fetcher using OpenWeatherMap API (free tier):
- Sign up for an API key at
openweathermap.org - Write a function that takes a city name and returns current temperature, humidity, and weather description
- Handle errors (invalid city, network timeout, API key issues)
- Fetch weather for 5 different cities and save results to a CSV file
Bonus: Use requests.Session() for connection pooling when making multiple requests.
Open notebook-sessions/week7/session1_apis_scraping.ipynb and call an open API (e.g., JSONPlaceholder). Use raise_for_status() and handle Timeout.
🕸️ Web Scraping
When there is no API, use BeautifulSoup to parse HTML.
from bs4 import BeautifulSoup
# Simulate HTML content
html_doc = """
<html>
<head><title>My Blog</title></head>
<body>
<h1 class="headline">Welcome to Python</h1>
<p class="content">Scraping is fun!</p>
<a href="http://example.com" id="link1">Link</a>
</body>
</html>
"""
# Parse HTML
soup = BeautifulSoup(html_doc, 'html.parser')
# Extract data
title = soup.title.string
headline = soup.find('h1', class_='headline').text
link = soup.find('a')['href']
print(f"Title: {title}")
print(f"Headline: {headline}")
Description: Build a BeautifulSoup scraper to extract headlines from a practice news site.
- Target
https://quotes.toscrape.com(a safe practice scraping site) - Extract all quotes, their authors, and associated tags from the homepage
- Use
find_all()to locate all quote containers - Handle missing tags gracefully (check for
Nonebefore accessing.text) - Store results in a list of dictionaries with keys:
quote,author,tags - Export the scraped data to a CSV file using Pandas
Advanced Challenge:
- Implement pagination by finding and following the "Next" button link
- Scrape all 10 pages of quotes
- Add
time.sleep(1)between page requests to be polite - Create a function
scrape_page(url)that returns a list of quotes from a single page
- Brittle Selectors: CSS selectors like
.class-namebreak when websites redesign. Your scraper needs constant maintenance—expect it. - Ignoring robots.txt: Check
website.com/robots.txtbefore scraping. Violating it can lead to legal issues or permanent IP bans. - No Rate Limiting: Hammering a server with requests crashes it and gets you blocked. Add
time.sleep(1)between requests as a courtesy. - Not Handling Missing Elements: Use
.find()and check forNonebefore accessing.text. Missing elements crash your script without defensive coding.
- Why is web scraping considered "fragile"? What happens when a website redesigns its HTML structure?
- How does BeautifulSoup's
find()differ fromfind_all()? When would you use each? - What strategies can you use to make scrapers more resilient to HTML changes (e.g., multiple fallback selectors)?
- Why is it important to add delays between scraping requests? What's a reasonable delay for most websites?
Challenge: Scrape headlines from a news website (choose a simple one like quotes.toscrape.com for practice):
- Fetch the homepage HTML using
requests.get() - Parse the HTML with BeautifulSoup
- Extract all article titles, authors, and timestamps
- Save results to a Pandas DataFrame and export to CSV
- Handle missing data gracefully (some articles may lack authors)
Bonus: Scrape multiple pages by following "Next Page" links in a loop.
Open notebook-sessions/week7/session1_apis_scraping.ipynb and parse simple HTML with BeautifulSoup. Extract title and links. Save HTML locally first.
⚖️ Ethics & Best Practices
Just because you can scrape it, doesn't mean you should.
- Check
robots.txt: (e.g., google.com/robots.txt) to see what is allowed. - Rate Limiting: Don't hammer the server. Sleep between requests (`time.sleep(1)`).
- User-Agent: Identify your bot script in the headers.
- Public Data Only: Do not scrape data behind a login unless authorized.
import time
import requests
headers = {
'User-Agent': 'MyStudentBot/1.0 (education purposes)'
}
urls = ['http://site.com/page1', 'http://site.com/page2']
for url in urls:
try:
# Be polite! Identify yourself
response = requests.get(url, headers=headers)
# ... process data ...
# Be polite! Wait between requests
time.sleep(1)
except Exception as e:
print(e)
Description: Build a production-ready API client that handles errors, timeouts, and rate limits gracefully.
- Create a function
fetch_with_retry(url, max_retries=3)that retries failed requests - Implement exponential backoff: wait 1s, 2s, 4s between retries
- Handle specific HTTP status codes:
429 Too Many Requests: Wait and retry based onRetry-Afterheader500-503 Server Errors: Retry with backoff401/403 Auth Errors: Log error and don't retry
- Add timeout handling with
timeout=10seconds - Log all attempts and failures for debugging
Bonus: Use the tenacity library for advanced retry logic with decorators.
Description: Create a complete data pipeline that respects API rate limits.
- Build a pipeline that fetches 100 posts from JSONPlaceholder API
- Implement a rate limiter: maximum 10 requests per minute
- Use
requests.Session()for connection pooling - Add logging to track request timing and any errors
- Save progress to a checkpoint file so you can resume if interrupted
- Export final results to both JSON and CSV formats
Advanced: Use asyncio with aiohttp for concurrent requests while still respecting rate limits.
- Ignoring Terms of Service: Most websites' ToS explicitly forbid scraping. Violating ToS can lead to legal action—read them before scraping.
- Scraping Personal Data: GDPR and CCPA regulate personal data collection. Scraping user profiles, emails, or private info can violate privacy laws.
- Not Respecting robots.txt: This file specifies what bots can access. Ignoring it is unethical and may be illegal under CFAA (Computer Fraud and Abuse Act).
- Overwhelming Small Servers: Small websites can't handle thousands of requests per second. Your scraper could cause a denial-of-service (DoS)—rate limit yourself.
- Where is the ethical line between "public data" and "data you shouldn't scrape"? Is scraping social media profiles ethical?
- What legal precedents exist around web scraping? Research the hiQ vs. LinkedIn case and its implications.
- How do you balance "data is publicly visible" with "the website owner doesn't want you scraping it"?
- What's the difference between scraping for personal research vs. commercial use? Why does intent matter legally?
Open notebook-sessions/week7/session2_apis_scraping_group.ipynb and implement polite scraping with headers, sleep, and robots.txt checks.
📜 Scraping Cheat Sheet
# ═══ REQUESTS ═══
import requests
resp = requests.get(url, headers=headers, params={'q': 'search'})
data = resp.json() # Parse JSON response
html = resp.text # Get raw text/HTML
resp.raise_for_status() # Check for errors
# ═══ BEAUTIFUL SOUP ═══
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
el = soup.find('div', id='main') # Find one
els = soup.find_all('a', class_='link') # Find all
text = el.text.strip() # Get text content
href = el['href'] # Get attribute
- Try APIs First: Always check if a website offers an API before scraping. APIs are faster, more reliable, and legally safer.
- HTTP Status Codes Matter: Check
response.status_code—200 is success, 401 is unauthorized, 429 is rate limit, 500 is server error. - Secure Your API Keys: Never hardcode keys in code or commit them to GitHub. Use environment variables with
os.getenv(). - Rate Limit Yourself: Add
time.sleep()between requests. Exceeding limits gets you IP-banned and is disrespectful to servers. - Web Scraping is Fragile: HTML changes break scrapers. Expect constant maintenance and use defensive coding (check for
None). - Respect robots.txt: Check
website.com/robots.txtto see what's allowed. Ignoring it can lead to legal consequences. - Ethics are Critical: Scraping personal data or violating ToS can violate laws (GDPR, CCPA, CFAA). Always research legality first.
- Save Raw Data First: Store JSON/HTML before parsing. If parsing fails, you won't need to re-fetch, saving time and server load.