From 8db07570b00766d5680577c9ef00e6b9b6fb5ef7 Mon Sep 17 00:00:00 2001 From: belisards Date: Sun, 29 Mar 2026 17:13:24 -0300 Subject: [PATCH] docs: add README, CLAUDE.md, and anonymised profiles.txt for public release --- CLAUDE.md | 77 ++++++++++++++++++++++++++++++++++++ README.md | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++ profiles.txt | 8 ++-- 3 files changed, 190 insertions(+), 5 deletions(-) create mode 100644 CLAUDE.md create mode 100644 README.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..ef6984a --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,77 @@ +# Instagram Scraper — Agent Context + +## What this project is + +A Playwright-based Instagram scraper. Reads `profiles.txt`, visits each profile in a real Chromium browser, scrapes the last N posts, and writes `output.csv` + `output.md`. + +## How to run + +```bash +uv run python scraper.py # run the scraper +uv run pytest tests/ -v # run unit tests +uv run playwright install chromium # install browser (first time only) +``` + +Always use `uv run` — never `python` directly or `pip`. + +## Key files + +| File | Purpose | +|---|---| +| `scraper.py` | All logic: auth, profile scraping, post scraping, output | +| `profiles.txt` | Input: one Instagram URL per line | +| `auth_state.json` | Saved Playwright session (gitignored, created on first run) | +| `output.csv` | Scraped results (gitignored) | +| `output.md` | Scraped results in Markdown (gitignored) | +| `tests/test_parsers.py` | Unit tests for pure parsing functions | + +## Architecture + +Single script, no framework. Key functions in `scraper.py`: + +- `extract_hashtags(text)` / `extract_mentions(text)` — pure regex, fully tested +- `profile_slug_from_url(url)` — extracts username from URL +- `read_profiles()` — reads `profiles.txt`, strips line number prefixes +- `is_logged_in(page)` / `ensure_authenticated(browser)` — session management +- `get_post_urls(page, profile_url)` — collects post links from profile grid +- `scrape_post(page, post_url, profile_slug)` — extracts all fields from a post +- `write_csv(posts)` / `write_markdown(posts)` — output writers +- `main()` — orchestrates everything + +## Output fields + +`profile`, `post_url`, `date` (ISO 8601), `caption`, `likes`, `image_urls` (comma-separated CDN URLs), `hashtags`, `mentions`, `location`, `media_type` (photo/video/carousel) + +## DOM selectors (verified against live Instagram) + +Instagram uses atomic CSS — class names change. These structural selectors are stable: + +- **Post grid links**: `a[href*="/p/"]` +- **Date**: `time[datetime]` → `.get_attribute('datetime')` (first element = post date) +- **Caption**: JS tree walker over text nodes in `section` containing a link to the profile slug; filter `len > 20`, skip relative times +- **Likes**: First `span` with purely numeric text +- **Images**: JS `img[src*="cdninstagram"]` filtered by `width > 100` and excluding `/s150x150/` +- **Carousel**: `button[aria-label="Next"]` exists +- **Video**: `video` element exists +- **Location**: `a[href*="/explore/locations/"]` — filter out text "Locations" (footer link) + +## Auth flow + +1. Check for `auth_state.json` → load with `browser.new_context(storage_state=...)` +2. If missing or expired → open visible browser, wait for user to log in manually, save with `context.storage_state(path="auth_state.json")` + +## Modifying behaviour + +- Change number of posts: edit `POSTS_PER_PROFILE` constant at top of `scraper.py` +- Change input/output paths: edit `PROFILES_FILE`, `OUTPUT_CSV`, `OUTPUT_MD` constants +- To re-authenticate: delete `auth_state.json` and re-run + +## Testing policy + +Pure functions (`extract_hashtags`, `extract_mentions`, `profile_slug_from_url`, `read_profiles`) have unit tests. Browser functions are tested end-to-end only — do not mock the browser. + +## Dependencies + +- `playwright>=1.40.0` — browser automation +- `pytest>=7.0.0` (dev) — test runner +- Standard library only: `re`, `csv`, `sys`, `pathlib`, `itertools` diff --git a/README.md b/README.md new file mode 100644 index 0000000..03ceb31 --- /dev/null +++ b/README.md @@ -0,0 +1,110 @@ +# Instagram Profile Scraper + +Scrapes the last N posts from a list of Instagram profiles and saves the results as CSV and Markdown. Uses Playwright with a real authenticated browser session — no API keys required. + +## What it does + +1. Reads a list of Instagram profile URLs from `profiles.txt` +2. Opens a Chromium browser (visible, non-headless) +3. On first run: prompts you to log in manually, then saves the session to `auth_state.json` +4. On subsequent runs: reuses the saved session automatically +5. Visits each profile, collects the last 5 post URLs +6. Visits each post and extracts: date, caption, likes, image URLs, hashtags, mentions, location, media type +7. Writes combined results to `output.csv` and `output.md` + +## Setup + +Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/). + +```bash +git clone +cd instagram-scraper +uv sync +uv run playwright install chromium +``` + +## Configuration + +**`profiles.txt`** — one Instagram profile URL per line. Lines can optionally be prefixed with a number and tab (the scraper strips them): + +``` +https://www.instagram.com/username1/ +https://www.instagram.com/username2/ +https://www.instagram.com/username3/ +``` + +**Constants in `scraper.py`** (edit directly): + +| Constant | Default | Description | +|---|---|---| +| `POSTS_PER_PROFILE` | `5` | How many posts to scrape per profile | +| `PROFILES_FILE` | `profiles.txt` | Input file path | +| `OUTPUT_CSV` | `output.csv` | CSV output path | +| `OUTPUT_MD` | `output.md` | Markdown output path | + +## Usage + +```bash +uv run python scraper.py +``` + +On first run, a Chromium window opens. Log in to Instagram, then press Enter in the terminal. The session is saved to `auth_state.json` and reused on future runs. + +If Instagram logs you out, delete `auth_state.json` and run again. + +## Output + +### CSV (`output.csv`) + +One row per post with these columns: + +| Column | Description | +|---|---| +| `profile` | Instagram username | +| `post_url` | Full URL of the post | +| `date` | ISO 8601 datetime (e.g. `2026-02-23T15:28:13.000Z`) | +| `caption` | Full post caption text | +| `likes` | Like count (as displayed) | +| `image_urls` | Comma-separated CDN image URLs | +| `hashtags` | Comma-separated hashtags from caption | +| `mentions` | Comma-separated @mentions from caption | +| `location` | Location tag text (empty if none) | +| `media_type` | `photo`, `video`, or `carousel` | + +### Markdown (`output.md`) + +Same data, grouped by profile. Each post is a section with all fields as a bullet list. + +## Error handling + +- Private or unavailable profiles: skipped with a warning, scraping continues +- Individual post failures: skipped with a warning, scraping continues +- Missing fields: stored as empty string, no crash +- Keyboard interrupt (Ctrl+C): saves whatever has been collected so far + +## Files + +``` +instagram-scraper/ +├── scraper.py # main script +├── profiles.txt # input: list of profile URLs +├── pyproject.toml # project metadata and dependencies +├── tests/ +│ └── test_parsers.py # unit tests for parsing functions +├── auth_state.json # saved session (created on first run, gitignored) +├── output.csv # results (gitignored) +└── output.md # results (gitignored) +``` + +## Running tests + +```bash +uv run pytest tests/ -v +``` + +## Notes + +- Works with public profiles. Private profiles are skipped. +- Instagram rate-limits aggressive scraping. The script adds a 1.5s wait between post requests. +- Session cookies expire periodically. Delete `auth_state.json` to re-authenticate. +- Image URLs are CDN URLs that expire after some time — download them promptly if needed. diff --git a/profiles.txt b/profiles.txt index 7cf7006..f162e16 100644 --- a/profiles.txt +++ b/profiles.txt @@ -1,5 +1,3 @@ -https://www.instagram.com/licmuunisul/ -https://www.instagram.com/ligamiunisulpb/ -https://www.instagram.com/lipracunisul/ -https://www.instagram.com/liaphunisul/ -https://www.instagram.com/lipali.unisul/ +https://www.instagram.com/username1/ +https://www.instagram.com/username2/ +https://www.instagram.com/username3/