chore: init project with playwright dependency
This commit is contained in:
@@ -0,0 +1,72 @@
|
||||
# Instagram Scraper — Design Spec
|
||||
|
||||
**Date:** 2026-03-29
|
||||
|
||||
## Overview
|
||||
|
||||
A single Python script (`scraper.py`) that uses Playwright with a persistent Chromium browser profile to scrape the last 5 posts from each Instagram profile listed in `profiles.txt`. Output is a combined `output.csv` and `output.md`.
|
||||
|
||||
## Architecture
|
||||
|
||||
Single script, no external services. Persistent browser profile (`./browser_profile/`) stores cookies/session so authentication only needs to happen once.
|
||||
|
||||
## Flow
|
||||
|
||||
1. Launch Chromium with `user_data_dir=./browser_profile/` (non-headless)
|
||||
2. Check if already authenticated (detect Instagram home feed); if not, pause and wait for user to log in manually, then press Enter to continue
|
||||
3. Read `profiles.txt`, extract profile URLs (skip blank lines and line numbers)
|
||||
4. For each profile:
|
||||
- Navigate to the profile page
|
||||
- Wait for the posts grid to load
|
||||
- Collect the first 5 post links from the grid
|
||||
- Visit each post page
|
||||
- Scrape all data fields (see below)
|
||||
5. Write all scraped posts to `output.csv` and `output.md`
|
||||
|
||||
## Data Fields
|
||||
|
||||
| Field | Description |
|
||||
|---|---|
|
||||
| `profile` | Profile username (from profiles.txt) |
|
||||
| `post_url` | Full URL of the post |
|
||||
| `date` | ISO datetime from `<time>` element's `datetime` attribute |
|
||||
| `caption` | Full caption text (expand "more" button if present) |
|
||||
| `likes` | Like count (integer or raw string if not parseable) |
|
||||
| `image_urls` | Comma-separated list of image/video src URLs in the post |
|
||||
| `hashtags` | Hashtags extracted from caption (comma-separated) |
|
||||
| `mentions` | @mentions extracted from caption (comma-separated) |
|
||||
| `location` | Location tag text if present, else empty |
|
||||
| `media_type` | One of: `photo`, `video`, `carousel` |
|
||||
|
||||
## Output Format
|
||||
|
||||
**CSV (`output.csv`):** One row per post, columns matching data fields above.
|
||||
|
||||
**Markdown (`output.md`):** Grouped by profile. Each post is a second-level section with all fields as a bullet list.
|
||||
|
||||
## Error Handling
|
||||
|
||||
- If a profile page fails to load (404, private, etc.): log a warning, skip that profile, continue
|
||||
- If a post page fails to load: log a warning, skip that post
|
||||
- If a field cannot be found: store empty string (no crash)
|
||||
- On keyboard interrupt: flush any collected data to output files before exiting
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `playwright` (Python) — browser automation
|
||||
- Standard library only for CSV, Markdown writing and regex for hashtag/mention extraction
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
instagram/
|
||||
├── profiles.txt # input: list of profile URLs
|
||||
├── scraper.py # main script
|
||||
├── browser_profile/ # persistent Chromium session (gitignored)
|
||||
├── output.csv # combined output
|
||||
└── output.md # combined output
|
||||
```
|
||||
|
||||
## Session Persistence
|
||||
|
||||
The `./browser_profile/` directory holds the Chromium user data. Once authenticated, subsequent runs reuse the session without prompting. If Instagram logs the session out, the script detects this and prompts for re-authentication.
|
||||
Reference in New Issue
Block a user