Private
Public Access
1
0

chore: init project with playwright dependency

This commit is contained in:
belisards
2026-03-29 16:50:18 -03:00
commit c5a01190c1
5 changed files with 616 additions and 0 deletions

View File

@@ -0,0 +1,72 @@
# Instagram Scraper — Design Spec
**Date:** 2026-03-29
## Overview
A single Python script (`scraper.py`) that uses Playwright with a persistent Chromium browser profile to scrape the last 5 posts from each Instagram profile listed in `profiles.txt`. Output is a combined `output.csv` and `output.md`.
## Architecture
Single script, no external services. Persistent browser profile (`./browser_profile/`) stores cookies/session so authentication only needs to happen once.
## Flow
1. Launch Chromium with `user_data_dir=./browser_profile/` (non-headless)
2. Check if already authenticated (detect Instagram home feed); if not, pause and wait for user to log in manually, then press Enter to continue
3. Read `profiles.txt`, extract profile URLs (skip blank lines and line numbers)
4. For each profile:
- Navigate to the profile page
- Wait for the posts grid to load
- Collect the first 5 post links from the grid
- Visit each post page
- Scrape all data fields (see below)
5. Write all scraped posts to `output.csv` and `output.md`
## Data Fields
| Field | Description |
|---|---|
| `profile` | Profile username (from profiles.txt) |
| `post_url` | Full URL of the post |
| `date` | ISO datetime from `<time>` element's `datetime` attribute |
| `caption` | Full caption text (expand "more" button if present) |
| `likes` | Like count (integer or raw string if not parseable) |
| `image_urls` | Comma-separated list of image/video src URLs in the post |
| `hashtags` | Hashtags extracted from caption (comma-separated) |
| `mentions` | @mentions extracted from caption (comma-separated) |
| `location` | Location tag text if present, else empty |
| `media_type` | One of: `photo`, `video`, `carousel` |
## Output Format
**CSV (`output.csv`):** One row per post, columns matching data fields above.
**Markdown (`output.md`):** Grouped by profile. Each post is a second-level section with all fields as a bullet list.
## Error Handling
- If a profile page fails to load (404, private, etc.): log a warning, skip that profile, continue
- If a post page fails to load: log a warning, skip that post
- If a field cannot be found: store empty string (no crash)
- On keyboard interrupt: flush any collected data to output files before exiting
## Dependencies
- `playwright` (Python) — browser automation
- Standard library only for CSV, Markdown writing and regex for hashtag/mention extraction
## Files
```
instagram/
├── profiles.txt # input: list of profile URLs
├── scraper.py # main script
├── browser_profile/ # persistent Chromium session (gitignored)
├── output.csv # combined output
└── output.md # combined output
```
## Session Persistence
The `./browser_profile/` directory holds the Chromium user data. Once authenticated, subsequent runs reuse the session without prompting. If Instagram logs the session out, the script detects this and prompts for re-authentication.