chore: init project with playwright dependency

2026-03-29 16:50:18 -03:00
commit c5a01190c1
5 changed files with 616 additions and 0 deletions
--- a/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md
+++ b/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md
@@ -0,0 +1,72 @@
+# Instagram Scraper — Design Spec
+
+**Date:** 2026-03-29
+
+## Overview
+
+A single Python script (`scraper.py`) that uses Playwright with a persistent Chromium browser profile to scrape the last 5 posts from each Instagram profile listed in `profiles.txt`. Output is a combined `output.csv` and `output.md`.
+
+## Architecture
+
+Single script, no external services. Persistent browser profile (`./browser_profile/`) stores cookies/session so authentication only needs to happen once.
+
+## Flow
+
+1. Launch Chromium with `user_data_dir=./browser_profile/` (non-headless)
+2. Check if already authenticated (detect Instagram home feed); if not, pause and wait for user to log in manually, then press Enter to continue
+3. Read `profiles.txt`, extract profile URLs (skip blank lines and line numbers)
+4. For each profile:
+   - Navigate to the profile page
+   - Wait for the posts grid to load
+   - Collect the first 5 post links from the grid
+   - Visit each post page
+   - Scrape all data fields (see below)
+5. Write all scraped posts to `output.csv` and `output.md`
+
+## Data Fields
+
+| Field | Description |
+|---|---|
+| `profile` | Profile username (from profiles.txt) |
+| `post_url` | Full URL of the post |
+| `date` | ISO datetime from `<time>` element's `datetime` attribute |
+| `caption` | Full caption text (expand "more" button if present) |
+| `likes` | Like count (integer or raw string if not parseable) |
+| `image_urls` | Comma-separated list of image/video src URLs in the post |
+| `hashtags` | Hashtags extracted from caption (comma-separated) |
+| `mentions` | @mentions extracted from caption (comma-separated) |
+| `location` | Location tag text if present, else empty |
+| `media_type` | One of: `photo`, `video`, `carousel` |
+
+## Output Format
+
+**CSV (`output.csv`):** One row per post, columns matching data fields above.
+
+**Markdown (`output.md`):** Grouped by profile. Each post is a second-level section with all fields as a bullet list.
+
+## Error Handling
+
+- If a profile page fails to load (404, private, etc.): log a warning, skip that profile, continue
+- If a post page fails to load: log a warning, skip that post
+- If a field cannot be found: store empty string (no crash)
+- On keyboard interrupt: flush any collected data to output files before exiting
+
+## Dependencies
+
+- `playwright` (Python) — browser automation
+- Standard library only for CSV, Markdown writing and regex for hashtag/mention extraction
+
+## Files
+
+```
+instagram/
+├── profiles.txt          # input: list of profile URLs
+├── scraper.py            # main script
+├── browser_profile/      # persistent Chromium session (gitignored)
+├── output.csv            # combined output
+└── output.md             # combined output
+```
+
+## Session Persistence
+
+The `./browser_profile/` directory holds the Chromium user data. Once authenticated, subsequent runs reuse the session without prompting. If Instagram logs the session out, the script detects this and prompts for re-authentication.