commit c5a01190c13f4ade3b9caa52a7875df7aadc5d17 Author: belisards Date: Sun Mar 29 16:50:18 2026 -0300 chore: init project with playwright dependency diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..1cfc68c --- /dev/null +++ b/.gitignore @@ -0,0 +1,6 @@ +browser_profile/ +output.csv +output.md +__pycache__/ +.venv/ +*.pyc diff --git a/docs/superpowers/plans/2026-03-29-instagram-scraper.md b/docs/superpowers/plans/2026-03-29-instagram-scraper.md new file mode 100644 index 0000000..5ce6ea0 --- /dev/null +++ b/docs/superpowers/plans/2026-03-29-instagram-scraper.md @@ -0,0 +1,525 @@ +# Instagram Scraper Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Scrape the last 5 posts from each Instagram profile in `profiles.txt` and save combined output as `output.csv` and `output.md`. + +**Architecture:** Single `scraper.py` script using Playwright sync API with a persistent Chromium profile (`./browser_profile/`). Pure parsing functions are unit-tested; browser interaction is manually tested end-to-end. Auth state persists between runs. + +**Tech Stack:** Python 3.11+, Playwright (sync), uv for package management, stdlib csv/re for output. + +--- + +## File Structure + +| File | Responsibility | +|---|---| +| `pyproject.toml` | Project metadata and dependencies | +| `.gitignore` | Exclude browser_profile/, output files | +| `scraper.py` | All logic: auth check, profile reading, scraping, output writing | +| `tests/test_parsers.py` | Unit tests for pure parsing functions | + +--- + +### Task 1: Project Setup + +**Files:** +- Create: `pyproject.toml` +- Create: `.gitignore` + +- [ ] **Step 1: Create pyproject.toml** + +```toml +[project] +name = "instagram-scraper" +version = "0.1.0" +requires-python = ">=3.11" +dependencies = [ + "playwright>=1.40.0", +] + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" +``` + +- [ ] **Step 2: Create .gitignore** + +``` +browser_profile/ +output.csv +output.md +__pycache__/ +.venv/ +*.pyc +``` + +- [ ] **Step 3: Install dependencies and Playwright browsers** + +```bash +uv sync +uv run playwright install chromium +``` + +Expected: Chromium browser downloaded successfully. + +- [ ] **Step 4: Commit** + +```bash +git init +git add pyproject.toml .gitignore profiles.txt +git commit -m "chore: init project with playwright dependency" +``` + +--- + +### Task 2: Pure Parser Functions + Tests + +**Files:** +- Create: `scraper.py` (parser functions only) +- Create: `tests/test_parsers.py` + +- [ ] **Step 1: Write failing tests** + +Create `tests/__init__.py` (empty), then `tests/test_parsers.py`: + +```python +import sys, os +sys.path.insert(0, os.path.dirname(os.path.dirname(__file__))) +from scraper import extract_hashtags, extract_mentions, profile_slug_from_url + +def test_extract_hashtags_basic(): + assert extract_hashtags("Hello #world #foo") == ["#world", "#foo"] + +def test_extract_hashtags_empty(): + assert extract_hashtags("No tags here") == [] + +def test_extract_hashtags_deduplicates(): + assert extract_hashtags("#foo #foo #bar") == ["#foo", "#bar"] + +def test_extract_mentions_basic(): + assert extract_mentions("Hey @alice and @bob") == ["@alice", "@bob"] + +def test_extract_mentions_empty(): + assert extract_mentions("No mentions") == [] + +def test_profile_slug_from_url(): + assert profile_slug_from_url("https://www.instagram.com/licmuunisul/") == "licmuunisul" + +def test_profile_slug_trailing_slash(): + assert profile_slug_from_url("https://www.instagram.com/licmuunisul") == "licmuunisul" +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +uv run pytest tests/test_parsers.py -v +``` + +Expected: `ImportError` or `ModuleNotFoundError` (scraper.py doesn't exist yet). + +- [ ] **Step 3: Implement parser functions in scraper.py** + +Create `scraper.py`: + +```python +import re +import csv +import sys +from pathlib import Path +from playwright.sync_api import sync_playwright, Page + +PROFILES_FILE = Path("profiles.txt") +OUTPUT_CSV = Path("output.csv") +OUTPUT_MD = Path("output.md") +BROWSER_PROFILE = Path("browser_profile") +POSTS_PER_PROFILE = 5 + + +def extract_hashtags(text: str) -> list[str]: + seen = set() + result = [] + for tag in re.findall(r"#\w+", text): + if tag not in seen: + seen.add(tag) + result.append(tag) + return result + + +def extract_mentions(text: str) -> list[str]: + seen = set() + result = [] + for mention in re.findall(r"@\w+", text): + if mention not in seen: + seen.add(mention) + result.append(mention) + return result + + +def profile_slug_from_url(url: str) -> str: + return url.rstrip("/").split("/")[-1] + + +def read_profiles() -> list[str]: + urls = [] + for line in PROFILES_FILE.read_text().splitlines(): + line = line.strip() + if not line: + continue + # Lines may be prefixed with a number and tab + parts = line.split("\t") + url = parts[-1].strip() + if url.startswith("http"): + urls.append(url) + return urls +``` + +- [ ] **Step 4: Run tests to verify they pass** + +```bash +uv run pytest tests/test_parsers.py -v +``` + +Expected: all 7 tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add scraper.py tests/ +git commit -m "feat: add parser functions with tests" +``` + +--- + +### Task 3: Auth Check + +**Files:** +- Modify: `scraper.py` (add `is_logged_in`, `ensure_authenticated`) + +- [ ] **Step 1: Add auth functions to scraper.py** + +Append after the `read_profiles` function: + +```python +def is_logged_in(page: Page) -> bool: + page.goto("https://www.instagram.com/", wait_until="networkidle", timeout=30000) + # Logged in: shows feed or home icon. Not logged in: shows login form. + return page.locator("input[name='username']").count() == 0 + + +def ensure_authenticated(page: Page) -> None: + if is_logged_in(page): + print("[auth] Session active, proceeding.") + return + print("[auth] Not logged in. Please log in to Instagram in the browser window.") + print("[auth] Press Enter here when you are logged in and can see your feed...") + input() + # Verify login succeeded + if not is_logged_in(page): + print("[auth] Still not logged in. Please try again and restart the script.") + sys.exit(1) + print("[auth] Login confirmed.") +``` + +- [ ] **Step 2: Commit** + +```bash +git add scraper.py +git commit -m "feat: add auth check and manual login wait" +``` + +--- + +### Task 4: Post URL Collection from Profile Grid + +**Files:** +- Modify: `scraper.py` (add `get_post_urls`) + +- [ ] **Step 1: Add post URL collector** + +Append after `ensure_authenticated`: + +```python +def get_post_urls(page: Page, profile_url: str, count: int = POSTS_PER_PROFILE) -> list[str]: + slug = profile_slug_from_url(profile_url) + print(f"[{slug}] Navigating to profile...") + try: + page.goto(profile_url, wait_until="networkidle", timeout=30000) + except Exception as e: + print(f"[{slug}] Failed to load profile: {e}") + return [] + + # Wait for posts grid + try: + page.wait_for_selector("article a[href*='/p/']", timeout=15000) + except Exception: + print(f"[{slug}] No posts found or profile is private.") + return [] + + links = page.locator("article a[href*='/p/']").all() + seen = set() + urls = [] + for link in links: + href = link.get_attribute("href") + if href and href not in seen: + seen.add(href) + urls.append("https://www.instagram.com" + href) + if len(urls) >= count: + break + + print(f"[{slug}] Found {len(urls)} post URLs.") + return urls +``` + +- [ ] **Step 2: Commit** + +```bash +git add scraper.py +git commit -m "feat: collect post URLs from profile grid" +``` + +--- + +### Task 5: Individual Post Scraper + +**Files:** +- Modify: `scraper.py` (add `scrape_post`) + +- [ ] **Step 1: Add post scraper** + +Append after `get_post_urls`: + +```python +def scrape_post(page: Page, post_url: str, profile_slug: str) -> dict: + print(f" Scraping {post_url}") + result = { + "profile": profile_slug, + "post_url": post_url, + "date": "", + "caption": "", + "likes": "", + "image_urls": "", + "hashtags": "", + "mentions": "", + "location": "", + "media_type": "", + } + + try: + page.goto(post_url, wait_until="networkidle", timeout=30000) + except Exception as e: + print(f" Failed to load post: {e}") + return result + + # Date + time_el = page.locator("time[datetime]").first + if time_el.count(): + result["date"] = time_el.get_attribute("datetime") or "" + + # Caption — expand "more" if present + more_btn = page.locator("span[role='button']").filter(has_text=re.compile(r"more", re.I)) + if more_btn.count(): + try: + more_btn.first.click() + page.wait_for_timeout(500) + except Exception: + pass + caption_el = page.locator("article h1, article div[data-testid='post-comment-root'] span").first + if caption_el.count(): + result["caption"] = caption_el.inner_text().strip() + + # Likes + likes_el = page.locator("section span:has-text('like'), section a:has-text('like')").first + if likes_el.count(): + result["likes"] = likes_el.inner_text().strip() + else: + # Fallback: aria-label on like button section + like_section = page.locator("section._ae2s, section[class*='like']").first + if like_section.count(): + result["likes"] = like_section.inner_text().strip() + + # Media type + image URLs + carousel = page.locator("div[data-testid='media-number-indicator'], button[aria-label*='Next']") + video = page.locator("video") + if carousel.count(): + result["media_type"] = "carousel" + elif video.count(): + result["media_type"] = "video" + else: + result["media_type"] = "photo" + + imgs = page.locator("article img[src]").all() + img_urls = [img.get_attribute("src") for img in imgs if img.get_attribute("src")] + result["image_urls"] = ", ".join(img_urls) + + # Hashtags and mentions from caption + result["hashtags"] = ", ".join(extract_hashtags(result["caption"])) + result["mentions"] = ", ".join(extract_mentions(result["caption"])) + + # Location + loc_el = page.locator("a[href*='/explore/locations/']").first + if loc_el.count(): + result["location"] = loc_el.inner_text().strip() + + return result +``` + +- [ ] **Step 2: Commit** + +```bash +git add scraper.py +git commit -m "feat: implement individual post scraper" +``` + +--- + +### Task 6: Output Writers + +**Files:** +- Modify: `scraper.py` (add `write_csv`, `write_markdown`) + +- [ ] **Step 1: Add output writers** + +Append after `scrape_post`: + +```python +FIELDS = ["profile", "post_url", "date", "caption", "likes", "image_urls", + "hashtags", "mentions", "location", "media_type"] + + +def write_csv(posts: list[dict]) -> None: + with OUTPUT_CSV.open("w", newline="", encoding="utf-8") as f: + writer = csv.DictWriter(f, fieldnames=FIELDS) + writer.writeheader() + writer.writerows(posts) + print(f"[output] CSV saved to {OUTPUT_CSV} ({len(posts)} posts)") + + +def write_markdown(posts: list[dict]) -> None: + from itertools import groupby + with OUTPUT_MD.open("w", encoding="utf-8") as f: + f.write("# Instagram Scrape Results\n\n") + for profile, group in groupby(posts, key=lambda p: p["profile"]): + f.write(f"## {profile}\n\n") + for post in group: + f.write(f"### [{post['post_url']}]({post['post_url']})\n\n") + for field in FIELDS: + if field in ("profile", "post_url"): + continue + value = post.get(field, "") + if value: + f.write(f"- **{field}:** {value}\n") + f.write("\n") + print(f"[output] Markdown saved to {OUTPUT_MD}") +``` + +- [ ] **Step 2: Commit** + +```bash +git add scraper.py +git commit -m "feat: add CSV and markdown output writers" +``` + +--- + +### Task 7: Main Entry Point + +**Files:** +- Modify: `scraper.py` (add `main` function and `__main__` block) + +- [ ] **Step 1: Add main function** + +Append at the end of `scraper.py`: + +```python +def main(): + profiles = read_profiles() + if not profiles: + print("No profiles found in profiles.txt") + return + + print(f"[main] Loaded {len(profiles)} profiles.") + all_posts = [] + + BROWSER_PROFILE.mkdir(exist_ok=True) + + with sync_playwright() as p: + browser = p.chromium.launch_persistent_context( + user_data_dir=str(BROWSER_PROFILE), + headless=False, + viewport={"width": 1280, "height": 900}, + ) + page = browser.new_page() + + ensure_authenticated(page) + + try: + for profile_url in profiles: + slug = profile_slug_from_url(profile_url) + post_urls = get_post_urls(page, profile_url) + for post_url in post_urls: + post = scrape_post(page, post_url, slug) + all_posts.append(post) + except KeyboardInterrupt: + print("\n[main] Interrupted. Saving collected data...") + + browser.close() + + if all_posts: + write_csv(all_posts) + write_markdown(all_posts) + else: + print("[main] No posts collected.") + + +if __name__ == "__main__": + main() +``` + +- [ ] **Step 2: Run the parser tests one final time to confirm nothing broke** + +```bash +uv run pytest tests/test_parsers.py -v +``` + +Expected: all 7 tests PASS. + +- [ ] **Step 3: Commit** + +```bash +git add scraper.py +git commit -m "feat: wire main entry point, complete scraper" +``` + +--- + +### Task 8: End-to-End Test + +- [ ] **Step 1: Run the scraper** + +```bash +uv run python scraper.py +``` + +Expected: +- Chromium opens +- If not logged in: prompted to log in, then press Enter +- Script visits each of the 5 profiles, collects 5 post URLs each +- Visits each post (up to 25 total) and scrapes data +- Writes `output.csv` and `output.md` + +- [ ] **Step 2: Verify CSV output** + +```bash +head -3 output.csv +``` + +Expected: header row + at least 2 data rows with profile, post_url, date populated. + +- [ ] **Step 3: Verify Markdown output** + +```bash +head -30 output.md +``` + +Expected: `# Instagram Scrape Results` heading, profile sections, post subsections with bullet fields. diff --git a/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md b/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md new file mode 100644 index 0000000..3cf690c --- /dev/null +++ b/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md @@ -0,0 +1,72 @@ +# Instagram Scraper — Design Spec + +**Date:** 2026-03-29 + +## Overview + +A single Python script (`scraper.py`) that uses Playwright with a persistent Chromium browser profile to scrape the last 5 posts from each Instagram profile listed in `profiles.txt`. Output is a combined `output.csv` and `output.md`. + +## Architecture + +Single script, no external services. Persistent browser profile (`./browser_profile/`) stores cookies/session so authentication only needs to happen once. + +## Flow + +1. Launch Chromium with `user_data_dir=./browser_profile/` (non-headless) +2. Check if already authenticated (detect Instagram home feed); if not, pause and wait for user to log in manually, then press Enter to continue +3. Read `profiles.txt`, extract profile URLs (skip blank lines and line numbers) +4. For each profile: + - Navigate to the profile page + - Wait for the posts grid to load + - Collect the first 5 post links from the grid + - Visit each post page + - Scrape all data fields (see below) +5. Write all scraped posts to `output.csv` and `output.md` + +## Data Fields + +| Field | Description | +|---|---| +| `profile` | Profile username (from profiles.txt) | +| `post_url` | Full URL of the post | +| `date` | ISO datetime from `