docs: add README, CLAUDE.md, and anonymised profiles.txt for public release

2026-03-29 17:13:24 -03:00
parent 9996fa5f6d
commit 8db07570b0
3 changed files with 190 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,110 @@
+# Instagram Profile Scraper
+
+Scrapes the last N posts from a list of Instagram profiles and saves the results as CSV and Markdown. Uses Playwright with a real authenticated browser session — no API keys required.
+
+## What it does
+
+1. Reads a list of Instagram profile URLs from `profiles.txt`
+2. Opens a Chromium browser (visible, non-headless)
+3. On first run: prompts you to log in manually, then saves the session to `auth_state.json`
+4. On subsequent runs: reuses the saved session automatically
+5. Visits each profile, collects the last 5 post URLs
+6. Visits each post and extracts: date, caption, likes, image URLs, hashtags, mentions, location, media type
+7. Writes combined results to `output.csv` and `output.md`
+
+## Setup
+
+Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/).
+
+```bash
+git clone <repo-url>
+cd instagram-scraper
+uv sync
+uv run playwright install chromium
+```
+
+## Configuration
+
+**`profiles.txt`** — one Instagram profile URL per line. Lines can optionally be prefixed with a number and tab (the scraper strips them):
+
+```
+https://www.instagram.com/username1/
+https://www.instagram.com/username2/
+https://www.instagram.com/username3/
+```
+
+**Constants in `scraper.py`** (edit directly):
+
+| Constant | Default | Description |
+|---|---|---|
+| `POSTS_PER_PROFILE` | `5` | How many posts to scrape per profile |
+| `PROFILES_FILE` | `profiles.txt` | Input file path |
+| `OUTPUT_CSV` | `output.csv` | CSV output path |
+| `OUTPUT_MD` | `output.md` | Markdown output path |
+
+## Usage
+
+```bash
+uv run python scraper.py
+```
+
+On first run, a Chromium window opens. Log in to Instagram, then press Enter in the terminal. The session is saved to `auth_state.json` and reused on future runs.
+
+If Instagram logs you out, delete `auth_state.json` and run again.
+
+## Output
+
+### CSV (`output.csv`)
+
+One row per post with these columns:
+
+| Column | Description |
+|---|---|
+| `profile` | Instagram username |
+| `post_url` | Full URL of the post |
+| `date` | ISO 8601 datetime (e.g. `2026-02-23T15:28:13.000Z`) |
+| `caption` | Full post caption text |
+| `likes` | Like count (as displayed) |
+| `image_urls` | Comma-separated CDN image URLs |
+| `hashtags` | Comma-separated hashtags from caption |
+| `mentions` | Comma-separated @mentions from caption |
+| `location` | Location tag text (empty if none) |
+| `media_type` | `photo`, `video`, or `carousel` |
+
+### Markdown (`output.md`)
+
+Same data, grouped by profile. Each post is a section with all fields as a bullet list.
+
+## Error handling
+
+- Private or unavailable profiles: skipped with a warning, scraping continues
+- Individual post failures: skipped with a warning, scraping continues
+- Missing fields: stored as empty string, no crash
+- Keyboard interrupt (Ctrl+C): saves whatever has been collected so far
+
+## Files
+
+```
+instagram-scraper/
+├── scraper.py        # main script
+├── profiles.txt      # input: list of profile URLs
+├── pyproject.toml    # project metadata and dependencies
+├── tests/
+│   └── test_parsers.py  # unit tests for parsing functions
+├── auth_state.json   # saved session (created on first run, gitignored)
+├── output.csv        # results (gitignored)
+└── output.md         # results (gitignored)
+```
+
+## Running tests
+
+```bash
+uv run pytest tests/ -v
+```
+
+## Notes
+
+- Works with public profiles. Private profiles are skipped.
+- Instagram rate-limits aggressive scraping. The script adds a 1.5s wait between post requests.
+- Session cookies expire periodically. Delete `auth_state.json` to re-authenticate.
+- Image URLs are CDN URLs that expire after some time — download them promptly if needed.