Instagram Scraper — Agent Context

What this project is

A Playwright-based Instagram scraper. Reads profiles.txt, visits each profile in a real Chromium browser, scrapes the last N posts, and writes output.csv + output.md.

How to run

uv run python scraper.py        # run the scraper
uv run pytest tests/ -v         # run unit tests
uv run playwright install chromium  # install browser (first time only)

Always use uv run — never python directly or pip.

Key files

File	Purpose
`scraper.py`	All logic: auth, profile scraping, post scraping, output
`profiles.txt`	Input: one Instagram URL per line
`auth_state.json`	Saved Playwright session (gitignored, created on first run)
`output.csv`	Scraped results (gitignored)
`output.md`	Scraped results in Markdown (gitignored)
`tests/test_parsers.py`	Unit tests for pure parsing functions

Architecture

Single script, no framework. Key functions in scraper.py:

extract_hashtags(text) / extract_mentions(text) — pure regex, fully tested
profile_slug_from_url(url) — extracts username from URL
read_profiles() — reads profiles.txt, strips line number prefixes
is_logged_in(page) / ensure_authenticated(browser) — session management
get_post_urls(page, profile_url) — collects post links from profile grid
scrape_post(page, post_url, profile_slug) — extracts all fields from a post
write_csv(posts) / write_markdown(posts) — output writers
main() — orchestrates everything

Output fields

profile, post_url, date (ISO 8601), caption, likes, image_urls (comma-separated CDN URLs), hashtags, mentions, location, media_type (photo/video/carousel)

DOM selectors (verified against live Instagram)

Instagram uses atomic CSS — class names change. These structural selectors are stable:

Post grid links: a[href*="/p/"]
Date: time[datetime] → .get_attribute('datetime') (first element = post date)
Caption: JS tree walker over text nodes in section containing a link to the profile slug; filter len > 20, skip relative times
Likes: First span with purely numeric text
Images: JS img[src*="cdninstagram"] filtered by width > 100 and excluding /s150x150/
Carousel: button[aria-label="Next"] exists
Video: video element exists
Location: a[href*="/explore/locations/"] — filter out text "Locations" (footer link)

Auth flow

Check for auth_state.json → load with browser.new_context(storage_state=...)
If missing or expired → open visible browser, wait for user to log in manually, save with context.storage_state(path="auth_state.json")

Modifying behaviour

Change number of posts: edit POSTS_PER_PROFILE constant at top of scraper.py
Change input/output paths: edit PROFILES_FILE, OUTPUT_CSV, OUTPUT_MD constants
To re-authenticate: delete auth_state.json and re-run

Testing policy

Pure functions (extract_hashtags, extract_mentions, profile_slug_from_url, read_profiles) have unit tests. Browser functions are tested end-to-end only — do not mock the browser.

Dependencies

playwright>=1.40.0 — browser automation
pytest>=7.0.0 (dev) — test runner
Standard library only: re, csv, sys, pathlib, itertools

3.3 KiB Raw Blame History