78 lines
3.3 KiB
Markdown
78 lines
3.3 KiB
Markdown
# Instagram Scraper — Agent Context
|
|
|
|
## What this project is
|
|
|
|
A Playwright-based Instagram scraper. Reads `profiles.txt`, visits each profile in a real Chromium browser, scrapes the last N posts, and writes `output.csv` + `output.md`.
|
|
|
|
## How to run
|
|
|
|
```bash
|
|
uv run python scraper.py # run the scraper
|
|
uv run pytest tests/ -v # run unit tests
|
|
uv run playwright install chromium # install browser (first time only)
|
|
```
|
|
|
|
Always use `uv run` — never `python` directly or `pip`.
|
|
|
|
## Key files
|
|
|
|
| File | Purpose |
|
|
|---|---|
|
|
| `scraper.py` | All logic: auth, profile scraping, post scraping, output |
|
|
| `profiles.txt` | Input: one Instagram URL per line |
|
|
| `auth_state.json` | Saved Playwright session (gitignored, created on first run) |
|
|
| `output.csv` | Scraped results (gitignored) |
|
|
| `output.md` | Scraped results in Markdown (gitignored) |
|
|
| `tests/test_parsers.py` | Unit tests for pure parsing functions |
|
|
|
|
## Architecture
|
|
|
|
Single script, no framework. Key functions in `scraper.py`:
|
|
|
|
- `extract_hashtags(text)` / `extract_mentions(text)` — pure regex, fully tested
|
|
- `profile_slug_from_url(url)` — extracts username from URL
|
|
- `read_profiles()` — reads `profiles.txt`, strips line number prefixes
|
|
- `is_logged_in(page)` / `ensure_authenticated(browser)` — session management
|
|
- `get_post_urls(page, profile_url)` — collects post links from profile grid
|
|
- `scrape_post(page, post_url, profile_slug)` — extracts all fields from a post
|
|
- `write_csv(posts)` / `write_markdown(posts)` — output writers
|
|
- `main()` — orchestrates everything
|
|
|
|
## Output fields
|
|
|
|
`profile`, `post_url`, `date` (ISO 8601), `caption`, `likes`, `image_urls` (comma-separated CDN URLs), `hashtags`, `mentions`, `location`, `media_type` (photo/video/carousel)
|
|
|
|
## DOM selectors (verified against live Instagram)
|
|
|
|
Instagram uses atomic CSS — class names change. These structural selectors are stable:
|
|
|
|
- **Post grid links**: `a[href*="/p/"]`
|
|
- **Date**: `time[datetime]` → `.get_attribute('datetime')` (first element = post date)
|
|
- **Caption**: JS tree walker over text nodes in `section` containing a link to the profile slug; filter `len > 20`, skip relative times
|
|
- **Likes**: First `span` with purely numeric text
|
|
- **Images**: JS `img[src*="cdninstagram"]` filtered by `width > 100` and excluding `/s150x150/`
|
|
- **Carousel**: `button[aria-label="Next"]` exists
|
|
- **Video**: `video` element exists
|
|
- **Location**: `a[href*="/explore/locations/"]` — filter out text "Locations" (footer link)
|
|
|
|
## Auth flow
|
|
|
|
1. Check for `auth_state.json` → load with `browser.new_context(storage_state=...)`
|
|
2. If missing or expired → open visible browser, wait for user to log in manually, save with `context.storage_state(path="auth_state.json")`
|
|
|
|
## Modifying behaviour
|
|
|
|
- Change number of posts: edit `POSTS_PER_PROFILE` constant at top of `scraper.py`
|
|
- Change input/output paths: edit `PROFILES_FILE`, `OUTPUT_CSV`, `OUTPUT_MD` constants
|
|
- To re-authenticate: delete `auth_state.json` and re-run
|
|
|
|
## Testing policy
|
|
|
|
Pure functions (`extract_hashtags`, `extract_mentions`, `profile_slug_from_url`, `read_profiles`) have unit tests. Browser functions are tested end-to-end only — do not mock the browser.
|
|
|
|
## Dependencies
|
|
|
|
- `playwright>=1.40.0` — browser automation
|
|
- `pytest>=7.0.0` (dev) — test runner
|
|
- Standard library only: `re`, `csv`, `sys`, `pathlib`, `itertools`
|