docs: add README, CLAUDE.md, and anonymised profiles.txt for public release

2026-03-29 17:13:24 -03:00
parent 9996fa5f6d
commit 8db07570b0
3 changed files with 190 additions and 5 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,77 @@
 # Instagram Scraper — Agent Context
 ## What this project is
 A Playwright-based Instagram scraper. Reads `profiles.txt`, visits each profile in a real Chromium browser, scrapes the last N posts, and writes `output.csv` + `output.md`.
 ## How to run
 ```bash
 uv run python scraper.py        # run the scraper
 uv run pytest tests/ -v         # run unit tests
 uv run playwright install chromium  # install browser (first time only)
 ```
 Always use `uv run` — never `python` directly or `pip`.
 ## Key files
 | File | Purpose |
 |---|---|
 | `scraper.py` | All logic: auth, profile scraping, post scraping, output |
 | `profiles.txt` | Input: one Instagram URL per line |
 | `auth_state.json` | Saved Playwright session (gitignored, created on first run) |
 | `output.csv` | Scraped results (gitignored) |
 | `output.md` | Scraped results in Markdown (gitignored) |
 | `tests/test_parsers.py` | Unit tests for pure parsing functions |
 ## Architecture
 Single script, no framework. Key functions in `scraper.py`:
 - `extract_hashtags(text)` / `extract_mentions(text)` — pure regex, fully tested
 - `profile_slug_from_url(url)` — extracts username from URL
 - `read_profiles()` — reads `profiles.txt`, strips line number prefixes
 - `is_logged_in(page)` / `ensure_authenticated(browser)` — session management
 - `get_post_urls(page, profile_url)` — collects post links from profile grid
 - `scrape_post(page, post_url, profile_slug)` — extracts all fields from a post
 - `write_csv(posts)` / `write_markdown(posts)` — output writers
 - `main()` — orchestrates everything
 ## Output fields
 `profile`, `post_url`, `date` (ISO 8601), `caption`, `likes`, `image_urls` (comma-separated CDN URLs), `hashtags`, `mentions`, `location`, `media_type` (photo/video/carousel)
 ## DOM selectors (verified against live Instagram)
 Instagram uses atomic CSS — class names change. These structural selectors are stable:
 - **Post grid links**: `a[href*="/p/"]`
 - **Date**: `time[datetime]` → `.get_attribute('datetime')` (first element = post date)
 - **Caption**: JS tree walker over text nodes in `section` containing a link to the profile slug; filter `len > 20`, skip relative times
 - **Likes**: First `span` with purely numeric text
 - **Images**: JS `img[src*="cdninstagram"]` filtered by `width > 100` and excluding `/s150x150/`
 - **Carousel**: `button[aria-label="Next"]` exists
 - **Video**: `video` element exists
 - **Location**: `a[href*="/explore/locations/"]` — filter out text "Locations" (footer link)
 ## Auth flow
 1. Check for `auth_state.json` → load with `browser.new_context(storage_state=...)`
 2. If missing or expired → open visible browser, wait for user to log in manually, save with `context.storage_state(path="auth_state.json")`
 ## Modifying behaviour
 - Change number of posts: edit `POSTS_PER_PROFILE` constant at top of `scraper.py`
 - Change input/output paths: edit `PROFILES_FILE`, `OUTPUT_CSV`, `OUTPUT_MD` constants
 - To re-authenticate: delete `auth_state.json` and re-run
 ## Testing policy
 Pure functions (`extract_hashtags`, `extract_mentions`, `profile_slug_from_url`, `read_profiles`) have unit tests. Browser functions are tested end-to-end only — do not mock the browser.
 ## Dependencies
 - `playwright>=1.40.0` — browser automation
 - `pytest>=7.0.0` (dev) — test runner
 - Standard library only: `re`, `csv`, `sys`, `pathlib`, `itertools`
--- a/README.md
+++ b/README.md
@@ -0,0 +1,110 @@
 # Instagram Profile Scraper
 Scrapes the last N posts from a list of Instagram profiles and saves the results as CSV and Markdown. Uses Playwright with a real authenticated browser session — no API keys required.
 ## What it does
 1. Reads a list of Instagram profile URLs from `profiles.txt`
 2. Opens a Chromium browser (visible, non-headless)
 3. On first run: prompts you to log in manually, then saves the session to `auth_state.json`
 4. On subsequent runs: reuses the saved session automatically
 5. Visits each profile, collects the last 5 post URLs
 6. Visits each post and extracts: date, caption, likes, image URLs, hashtags, mentions, location, media type
 7. Writes combined results to `output.csv` and `output.md`
 ## Setup
 Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/).
 ```bash
 git clone <repo-url>
 cd instagram-scraper
 uv sync
 uv run playwright install chromium
 ```
 ## Configuration
 **`profiles.txt`** — one Instagram profile URL per line. Lines can optionally be prefixed with a number and tab (the scraper strips them):
 ```
 https://www.instagram.com/username1/
 https://www.instagram.com/username2/
 https://www.instagram.com/username3/
 ```
 **Constants in `scraper.py`** (edit directly):
 | Constant | Default | Description |
 |---|---|---|
 | `POSTS_PER_PROFILE` | `5` | How many posts to scrape per profile |
 | `PROFILES_FILE` | `profiles.txt` | Input file path |
 | `OUTPUT_CSV` | `output.csv` | CSV output path |
 | `OUTPUT_MD` | `output.md` | Markdown output path |
 ## Usage
 ```bash
 uv run python scraper.py
 ```
 On first run, a Chromium window opens. Log in to Instagram, then press Enter in the terminal. The session is saved to `auth_state.json` and reused on future runs.
 If Instagram logs you out, delete `auth_state.json` and run again.
 ## Output
 ### CSV (`output.csv`)
 One row per post with these columns:
 | Column | Description |
 |---|---|
 | `profile` | Instagram username |
 | `post_url` | Full URL of the post |
 | `date` | ISO 8601 datetime (e.g. `2026-02-23T15:28:13.000Z`) |
 | `caption` | Full post caption text |
 | `likes` | Like count (as displayed) |
 | `image_urls` | Comma-separated CDN image URLs |
 | `hashtags` | Comma-separated hashtags from caption |
 | `mentions` | Comma-separated @mentions from caption |
 | `location` | Location tag text (empty if none) |
 | `media_type` | `photo`, `video`, or `carousel` |
 ### Markdown (`output.md`)
 Same data, grouped by profile. Each post is a section with all fields as a bullet list.
 ## Error handling
 - Private or unavailable profiles: skipped with a warning, scraping continues
 - Individual post failures: skipped with a warning, scraping continues
 - Missing fields: stored as empty string, no crash
 - Keyboard interrupt (Ctrl+C): saves whatever has been collected so far
 ## Files
 ```
 instagram-scraper/
 ├── scraper.py        # main script
 ├── profiles.txt      # input: list of profile URLs
 ├── pyproject.toml    # project metadata and dependencies
 ├── tests/
 │   └── test_parsers.py  # unit tests for parsing functions
 ├── auth_state.json   # saved session (created on first run, gitignored)
 ├── output.csv        # results (gitignored)
 └── output.md         # results (gitignored)
 ```
 ## Running tests
 ```bash
 uv run pytest tests/ -v
 ```
 ## Notes
 - Works with public profiles. Private profiles are skipped.
 - Instagram rate-limits aggressive scraping. The script adds a 1.5s wait between post requests.
 - Session cookies expire periodically. Delete `auth_state.json` to re-authenticate.
 - Image URLs are CDN URLs that expire after some time — download them promptly if needed.
--- a/profiles.txt
+++ b/profiles.txt
@@ -1,5 +1,3 @@
-https://www.instagram.com/licmuunisul/
+https://www.instagram.com/username1/
-https://www.instagram.com/ligamiunisulpb/
+https://www.instagram.com/username2/
-https://www.instagram.com/lipracunisul/
+https://www.instagram.com/username3/
 https://www.instagram.com/liaphunisul/
 https://www.instagram.com/lipali.unisul/