Private
Public Access
1
0
Files
insta_scraper/CLAUDE.md

3.3 KiB

Instagram Scraper — Agent Context

What this project is

A Playwright-based Instagram scraper. Reads profiles.txt, visits each profile in a real Chromium browser, scrapes the last N posts, and writes output.csv + output.md.

How to run

uv run python scraper.py        # run the scraper
uv run pytest tests/ -v         # run unit tests
uv run playwright install chromium  # install browser (first time only)

Always use uv run — never python directly or pip.

Key files

File Purpose
scraper.py All logic: auth, profile scraping, post scraping, output
profiles.txt Input: one Instagram URL per line
auth_state.json Saved Playwright session (gitignored, created on first run)
output.csv Scraped results (gitignored)
output.md Scraped results in Markdown (gitignored)
tests/test_parsers.py Unit tests for pure parsing functions

Architecture

Single script, no framework. Key functions in scraper.py:

  • extract_hashtags(text) / extract_mentions(text) — pure regex, fully tested
  • profile_slug_from_url(url) — extracts username from URL
  • read_profiles() — reads profiles.txt, strips line number prefixes
  • is_logged_in(page) / ensure_authenticated(browser) — session management
  • get_post_urls(page, profile_url) — collects post links from profile grid
  • scrape_post(page, post_url, profile_slug) — extracts all fields from a post
  • write_csv(posts) / write_markdown(posts) — output writers
  • main() — orchestrates everything

Output fields

profile, post_url, date (ISO 8601), caption, likes, image_urls (comma-separated CDN URLs), hashtags, mentions, location, media_type (photo/video/carousel)

DOM selectors (verified against live Instagram)

Instagram uses atomic CSS — class names change. These structural selectors are stable:

  • Post grid links: a[href*="/p/"]
  • Date: time[datetime].get_attribute('datetime') (first element = post date)
  • Caption: JS tree walker over text nodes in section containing a link to the profile slug; filter len > 20, skip relative times
  • Likes: First span with purely numeric text
  • Images: JS img[src*="cdninstagram"] filtered by width > 100 and excluding /s150x150/
  • Carousel: button[aria-label="Next"] exists
  • Video: video element exists
  • Location: a[href*="/explore/locations/"] — filter out text "Locations" (footer link)

Auth flow

  1. Check for auth_state.json → load with browser.new_context(storage_state=...)
  2. If missing or expired → open visible browser, wait for user to log in manually, save with context.storage_state(path="auth_state.json")

Modifying behaviour

  • Change number of posts: edit POSTS_PER_PROFILE constant at top of scraper.py
  • Change input/output paths: edit PROFILES_FILE, OUTPUT_CSV, OUTPUT_MD constants
  • To re-authenticate: delete auth_state.json and re-run

Testing policy

Pure functions (extract_hashtags, extract_mentions, profile_slug_from_url, read_profiles) have unit tests. Browser functions are tested end-to-end only — do not mock the browser.

Dependencies

  • playwright>=1.40.0 — browser automation
  • pytest>=7.0.0 (dev) — test runner
  • Standard library only: re, csv, sys, pathlib, itertools