3.3 KiB
3.3 KiB
Instagram Scraper — Agent Context
What this project is
A Playwright-based Instagram scraper. Reads profiles.txt, visits each profile in a real Chromium browser, scrapes the last N posts, and writes output.csv + output.md.
How to run
uv run python scraper.py # run the scraper
uv run pytest tests/ -v # run unit tests
uv run playwright install chromium # install browser (first time only)
Always use uv run — never python directly or pip.
Key files
| File | Purpose |
|---|---|
scraper.py |
All logic: auth, profile scraping, post scraping, output |
profiles.txt |
Input: one Instagram URL per line |
auth_state.json |
Saved Playwright session (gitignored, created on first run) |
output.csv |
Scraped results (gitignored) |
output.md |
Scraped results in Markdown (gitignored) |
tests/test_parsers.py |
Unit tests for pure parsing functions |
Architecture
Single script, no framework. Key functions in scraper.py:
extract_hashtags(text)/extract_mentions(text)— pure regex, fully testedprofile_slug_from_url(url)— extracts username from URLread_profiles()— readsprofiles.txt, strips line number prefixesis_logged_in(page)/ensure_authenticated(browser)— session managementget_post_urls(page, profile_url)— collects post links from profile gridscrape_post(page, post_url, profile_slug)— extracts all fields from a postwrite_csv(posts)/write_markdown(posts)— output writersmain()— orchestrates everything
Output fields
profile, post_url, date (ISO 8601), caption, likes, image_urls (comma-separated CDN URLs), hashtags, mentions, location, media_type (photo/video/carousel)
DOM selectors (verified against live Instagram)
Instagram uses atomic CSS — class names change. These structural selectors are stable:
- Post grid links:
a[href*="/p/"] - Date:
time[datetime]→.get_attribute('datetime')(first element = post date) - Caption: JS tree walker over text nodes in
sectioncontaining a link to the profile slug; filterlen > 20, skip relative times - Likes: First
spanwith purely numeric text - Images: JS
img[src*="cdninstagram"]filtered bywidth > 100and excluding/s150x150/ - Carousel:
button[aria-label="Next"]exists - Video:
videoelement exists - Location:
a[href*="/explore/locations/"]— filter out text "Locations" (footer link)
Auth flow
- Check for
auth_state.json→ load withbrowser.new_context(storage_state=...) - If missing or expired → open visible browser, wait for user to log in manually, save with
context.storage_state(path="auth_state.json")
Modifying behaviour
- Change number of posts: edit
POSTS_PER_PROFILEconstant at top ofscraper.py - Change input/output paths: edit
PROFILES_FILE,OUTPUT_CSV,OUTPUT_MDconstants - To re-authenticate: delete
auth_state.jsonand re-run
Testing policy
Pure functions (extract_hashtags, extract_mentions, profile_slug_from_url, read_profiles) have unit tests. Browser functions are tested end-to-end only — do not mock the browser.
Dependencies
playwright>=1.40.0— browser automationpytest>=7.0.0(dev) — test runner- Standard library only:
re,csv,sys,pathlib,itertools