chore: init project with playwright dependency

2026-03-29 16:50:18 -03:00
commit c5a01190c1
5 changed files with 616 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,6 @@
+browser_profile/
+output.csv
+output.md
+__pycache__/
+.venv/
+*.pyc
--- a/docs/superpowers/plans/2026-03-29-instagram-scraper.md
+++ b/docs/superpowers/plans/2026-03-29-instagram-scraper.md
@@ -0,0 +1,525 @@
+# Instagram Scraper Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Scrape the last 5 posts from each Instagram profile in `profiles.txt` and save combined output as `output.csv` and `output.md`.
+
+**Architecture:** Single `scraper.py` script using Playwright sync API with a persistent Chromium profile (`./browser_profile/`). Pure parsing functions are unit-tested; browser interaction is manually tested end-to-end. Auth state persists between runs.
+
+**Tech Stack:** Python 3.11+, Playwright (sync), uv for package management, stdlib csv/re for output.
+
+---
+
+## File Structure
+
+| File | Responsibility |
+|---|---|
+| `pyproject.toml` | Project metadata and dependencies |
+| `.gitignore` | Exclude browser_profile/, output files |
+| `scraper.py` | All logic: auth check, profile reading, scraping, output writing |
+| `tests/test_parsers.py` | Unit tests for pure parsing functions |
+
+---
+
+### Task 1: Project Setup
+
+**Files:**
+- Create: `pyproject.toml`
+- Create: `.gitignore`
+
+- [ ] **Step 1: Create pyproject.toml**
+
+```toml
+[project]
+name = "instagram-scraper"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+    "playwright>=1.40.0",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+```
+
+- [ ] **Step 2: Create .gitignore**
+
+```
+browser_profile/
+output.csv
+output.md
+__pycache__/
+.venv/
+*.pyc
+```
+
+- [ ] **Step 3: Install dependencies and Playwright browsers**
+
+```bash
+uv sync
+uv run playwright install chromium
+```
+
+Expected: Chromium browser downloaded successfully.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git init
+git add pyproject.toml .gitignore profiles.txt
+git commit -m "chore: init project with playwright dependency"
+```
+
+---
+
+### Task 2: Pure Parser Functions + Tests
+
+**Files:**
+- Create: `scraper.py` (parser functions only)
+- Create: `tests/test_parsers.py`
+
+- [ ] **Step 1: Write failing tests**
+
+Create `tests/__init__.py` (empty), then `tests/test_parsers.py`:
+
+```python
+import sys, os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+from scraper import extract_hashtags, extract_mentions, profile_slug_from_url
+
+def test_extract_hashtags_basic():
+    assert extract_hashtags("Hello #world #foo") == ["#world", "#foo"]
+
+def test_extract_hashtags_empty():
+    assert extract_hashtags("No tags here") == []
+
+def test_extract_hashtags_deduplicates():
+    assert extract_hashtags("#foo #foo #bar") == ["#foo", "#bar"]
+
+def test_extract_mentions_basic():
+    assert extract_mentions("Hey @alice and @bob") == ["@alice", "@bob"]
+
+def test_extract_mentions_empty():
+    assert extract_mentions("No mentions") == []
+
+def test_profile_slug_from_url():
+    assert profile_slug_from_url("https://www.instagram.com/licmuunisul/") == "licmuunisul"
+
+def test_profile_slug_trailing_slash():
+    assert profile_slug_from_url("https://www.instagram.com/licmuunisul") == "licmuunisul"
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+uv run pytest tests/test_parsers.py -v
+```
+
+Expected: `ImportError` or `ModuleNotFoundError` (scraper.py doesn't exist yet).
+
+- [ ] **Step 3: Implement parser functions in scraper.py**
+
+Create `scraper.py`:
+
+```python
+import re
+import csv
+import sys
+from pathlib import Path
+from playwright.sync_api import sync_playwright, Page
+
+PROFILES_FILE = Path("profiles.txt")
+OUTPUT_CSV = Path("output.csv")
+OUTPUT_MD = Path("output.md")
+BROWSER_PROFILE = Path("browser_profile")
+POSTS_PER_PROFILE = 5
+
+
+def extract_hashtags(text: str) -> list[str]:
+    seen = set()
+    result = []
+    for tag in re.findall(r"#\w+", text):
+        if tag not in seen:
+            seen.add(tag)
+            result.append(tag)
+    return result
+
+
+def extract_mentions(text: str) -> list[str]:
+    seen = set()
+    result = []
+    for mention in re.findall(r"@\w+", text):
+        if mention not in seen:
+            seen.add(mention)
+            result.append(mention)
+    return result
+
+
+def profile_slug_from_url(url: str) -> str:
+    return url.rstrip("/").split("/")[-1]
+
+
+def read_profiles() -> list[str]:
+    urls = []
+    for line in PROFILES_FILE.read_text().splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        # Lines may be prefixed with a number and tab
+        parts = line.split("\t")
+        url = parts[-1].strip()
+        if url.startswith("http"):
+            urls.append(url)
+    return urls
+```
+
+- [ ] **Step 4: Run tests to verify they pass**
+
+```bash
+uv run pytest tests/test_parsers.py -v
+```
+
+Expected: all 7 tests PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add scraper.py tests/
+git commit -m "feat: add parser functions with tests"
+```
+
+---
+
+### Task 3: Auth Check
+
+**Files:**
+- Modify: `scraper.py` (add `is_logged_in`, `ensure_authenticated`)
+
+- [ ] **Step 1: Add auth functions to scraper.py**
+
+Append after the `read_profiles` function:
+
+```python
+def is_logged_in(page: Page) -> bool:
+    page.goto("https://www.instagram.com/", wait_until="networkidle", timeout=30000)
+    # Logged in: shows feed or home icon. Not logged in: shows login form.
+    return page.locator("input[name='username']").count() == 0
+
+
+def ensure_authenticated(page: Page) -> None:
+    if is_logged_in(page):
+        print("[auth] Session active, proceeding.")
+        return
+    print("[auth] Not logged in. Please log in to Instagram in the browser window.")
+    print("[auth] Press Enter here when you are logged in and can see your feed...")
+    input()
+    # Verify login succeeded
+    if not is_logged_in(page):
+        print("[auth] Still not logged in. Please try again and restart the script.")
+        sys.exit(1)
+    print("[auth] Login confirmed.")
+```
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add scraper.py
+git commit -m "feat: add auth check and manual login wait"
+```
+
+---
+
+### Task 4: Post URL Collection from Profile Grid
+
+**Files:**
+- Modify: `scraper.py` (add `get_post_urls`)
+
+- [ ] **Step 1: Add post URL collector**
+
+Append after `ensure_authenticated`:
+
+```python
+def get_post_urls(page: Page, profile_url: str, count: int = POSTS_PER_PROFILE) -> list[str]:
+    slug = profile_slug_from_url(profile_url)
+    print(f"[{slug}] Navigating to profile...")
+    try:
+        page.goto(profile_url, wait_until="networkidle", timeout=30000)
+    except Exception as e:
+        print(f"[{slug}] Failed to load profile: {e}")
+        return []
+
+    # Wait for posts grid
+    try:
+        page.wait_for_selector("article a[href*='/p/']", timeout=15000)
+    except Exception:
+        print(f"[{slug}] No posts found or profile is private.")
+        return []
+
+    links = page.locator("article a[href*='/p/']").all()
+    seen = set()
+    urls = []
+    for link in links:
+        href = link.get_attribute("href")
+        if href and href not in seen:
+            seen.add(href)
+            urls.append("https://www.instagram.com" + href)
+        if len(urls) >= count:
+            break
+
+    print(f"[{slug}] Found {len(urls)} post URLs.")
+    return urls
+```
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add scraper.py
+git commit -m "feat: collect post URLs from profile grid"
+```
+
+---
+
+### Task 5: Individual Post Scraper
+
+**Files:**
+- Modify: `scraper.py` (add `scrape_post`)
+
+- [ ] **Step 1: Add post scraper**
+
+Append after `get_post_urls`:
+
+```python
+def scrape_post(page: Page, post_url: str, profile_slug: str) -> dict:
+    print(f"  Scraping {post_url}")
+    result = {
+        "profile": profile_slug,
+        "post_url": post_url,
+        "date": "",
+        "caption": "",
+        "likes": "",
+        "image_urls": "",
+        "hashtags": "",
+        "mentions": "",
+        "location": "",
+        "media_type": "",
+    }
+
+    try:
+        page.goto(post_url, wait_until="networkidle", timeout=30000)
+    except Exception as e:
+        print(f"  Failed to load post: {e}")
+        return result
+
+    # Date
+    time_el = page.locator("time[datetime]").first
+    if time_el.count():
+        result["date"] = time_el.get_attribute("datetime") or ""
+
+    # Caption — expand "more" if present
+    more_btn = page.locator("span[role='button']").filter(has_text=re.compile(r"more", re.I))
+    if more_btn.count():
+        try:
+            more_btn.first.click()
+            page.wait_for_timeout(500)
+        except Exception:
+            pass
+    caption_el = page.locator("article h1, article div[data-testid='post-comment-root'] span").first
+    if caption_el.count():
+        result["caption"] = caption_el.inner_text().strip()
+
+    # Likes
+    likes_el = page.locator("section span:has-text('like'), section a:has-text('like')").first
+    if likes_el.count():
+        result["likes"] = likes_el.inner_text().strip()
+    else:
+        # Fallback: aria-label on like button section
+        like_section = page.locator("section._ae2s, section[class*='like']").first
+        if like_section.count():
+            result["likes"] = like_section.inner_text().strip()
+
+    # Media type + image URLs
+    carousel = page.locator("div[data-testid='media-number-indicator'], button[aria-label*='Next']")
+    video = page.locator("video")
+    if carousel.count():
+        result["media_type"] = "carousel"
+    elif video.count():
+        result["media_type"] = "video"
+    else:
+        result["media_type"] = "photo"
+
+    imgs = page.locator("article img[src]").all()
+    img_urls = [img.get_attribute("src") for img in imgs if img.get_attribute("src")]
+    result["image_urls"] = ", ".join(img_urls)
+
+    # Hashtags and mentions from caption
+    result["hashtags"] = ", ".join(extract_hashtags(result["caption"]))
+    result["mentions"] = ", ".join(extract_mentions(result["caption"]))
+
+    # Location
+    loc_el = page.locator("a[href*='/explore/locations/']").first
+    if loc_el.count():
+        result["location"] = loc_el.inner_text().strip()
+
+    return result
+```
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add scraper.py
+git commit -m "feat: implement individual post scraper"
+```
+
+---
+
+### Task 6: Output Writers
+
+**Files:**
+- Modify: `scraper.py` (add `write_csv`, `write_markdown`)
+
+- [ ] **Step 1: Add output writers**
+
+Append after `scrape_post`:
+
+```python
+FIELDS = ["profile", "post_url", "date", "caption", "likes", "image_urls",
+          "hashtags", "mentions", "location", "media_type"]
+
+
+def write_csv(posts: list[dict]) -> None:
+    with OUTPUT_CSV.open("w", newline="", encoding="utf-8") as f:
+        writer = csv.DictWriter(f, fieldnames=FIELDS)
+        writer.writeheader()
+        writer.writerows(posts)
+    print(f"[output] CSV saved to {OUTPUT_CSV} ({len(posts)} posts)")
+
+
+def write_markdown(posts: list[dict]) -> None:
+    from itertools import groupby
+    with OUTPUT_MD.open("w", encoding="utf-8") as f:
+        f.write("# Instagram Scrape Results\n\n")
+        for profile, group in groupby(posts, key=lambda p: p["profile"]):
+            f.write(f"## {profile}\n\n")
+            for post in group:
+                f.write(f"### [{post['post_url']}]({post['post_url']})\n\n")
+                for field in FIELDS:
+                    if field in ("profile", "post_url"):
+                        continue
+                    value = post.get(field, "")
+                    if value:
+                        f.write(f"- **{field}:** {value}\n")
+                f.write("\n")
+    print(f"[output] Markdown saved to {OUTPUT_MD}")
+```
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add scraper.py
+git commit -m "feat: add CSV and markdown output writers"
+```
+
+---
+
+### Task 7: Main Entry Point
+
+**Files:**
+- Modify: `scraper.py` (add `main` function and `__main__` block)
+
+- [ ] **Step 1: Add main function**
+
+Append at the end of `scraper.py`:
+
+```python
+def main():
+    profiles = read_profiles()
+    if not profiles:
+        print("No profiles found in profiles.txt")
+        return
+
+    print(f"[main] Loaded {len(profiles)} profiles.")
+    all_posts = []
+
+    BROWSER_PROFILE.mkdir(exist_ok=True)
+
+    with sync_playwright() as p:
+        browser = p.chromium.launch_persistent_context(
+            user_data_dir=str(BROWSER_PROFILE),
+            headless=False,
+            viewport={"width": 1280, "height": 900},
+        )
+        page = browser.new_page()
+
+        ensure_authenticated(page)
+
+        try:
+            for profile_url in profiles:
+                slug = profile_slug_from_url(profile_url)
+                post_urls = get_post_urls(page, profile_url)
+                for post_url in post_urls:
+                    post = scrape_post(page, post_url, slug)
+                    all_posts.append(post)
+        except KeyboardInterrupt:
+            print("\n[main] Interrupted. Saving collected data...")
+
+        browser.close()
+
+    if all_posts:
+        write_csv(all_posts)
+        write_markdown(all_posts)
+    else:
+        print("[main] No posts collected.")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+- [ ] **Step 2: Run the parser tests one final time to confirm nothing broke**
+
+```bash
+uv run pytest tests/test_parsers.py -v
+```
+
+Expected: all 7 tests PASS.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add scraper.py
+git commit -m "feat: wire main entry point, complete scraper"
+```
+
+---
+
+### Task 8: End-to-End Test
+
+- [ ] **Step 1: Run the scraper**
+
+```bash
+uv run python scraper.py
+```
+
+Expected:
+- Chromium opens
+- If not logged in: prompted to log in, then press Enter
+- Script visits each of the 5 profiles, collects 5 post URLs each
+- Visits each post (up to 25 total) and scrapes data
+- Writes `output.csv` and `output.md`
+
+- [ ] **Step 2: Verify CSV output**
+
+```bash
+head -3 output.csv
+```
+
+Expected: header row + at least 2 data rows with profile, post_url, date populated.
+
+- [ ] **Step 3: Verify Markdown output**
+
+```bash
+head -30 output.md
+```
+
+Expected: `# Instagram Scrape Results` heading, profile sections, post subsections with bullet fields.
--- a/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md
+++ b/docs/superpowers/specs/2026-03-29-instagram-scraper-design.md
@@ -0,0 +1,72 @@
+# Instagram Scraper — Design Spec
+
+**Date:** 2026-03-29
+
+## Overview
+
+A single Python script (`scraper.py`) that uses Playwright with a persistent Chromium browser profile to scrape the last 5 posts from each Instagram profile listed in `profiles.txt`. Output is a combined `output.csv` and `output.md`.
+
+## Architecture
+
+Single script, no external services. Persistent browser profile (`./browser_profile/`) stores cookies/session so authentication only needs to happen once.
+
+## Flow
+
+1. Launch Chromium with `user_data_dir=./browser_profile/` (non-headless)
+2. Check if already authenticated (detect Instagram home feed); if not, pause and wait for user to log in manually, then press Enter to continue
+3. Read `profiles.txt`, extract profile URLs (skip blank lines and line numbers)
+4. For each profile:
+   - Navigate to the profile page
+   - Wait for the posts grid to load
+   - Collect the first 5 post links from the grid
+   - Visit each post page
+   - Scrape all data fields (see below)
+5. Write all scraped posts to `output.csv` and `output.md`
+
+## Data Fields
+
+| Field | Description |
+|---|---|
+| `profile` | Profile username (from profiles.txt) |
+| `post_url` | Full URL of the post |
+| `date` | ISO datetime from `<time>` element's `datetime` attribute |
+| `caption` | Full caption text (expand "more" button if present) |
+| `likes` | Like count (integer or raw string if not parseable) |
+| `image_urls` | Comma-separated list of image/video src URLs in the post |
+| `hashtags` | Hashtags extracted from caption (comma-separated) |
+| `mentions` | @mentions extracted from caption (comma-separated) |
+| `location` | Location tag text if present, else empty |
+| `media_type` | One of: `photo`, `video`, `carousel` |
+
+## Output Format
+
+**CSV (`output.csv`):** One row per post, columns matching data fields above.
+
+**Markdown (`output.md`):** Grouped by profile. Each post is a second-level section with all fields as a bullet list.
+
+## Error Handling
+
+- If a profile page fails to load (404, private, etc.): log a warning, skip that profile, continue
+- If a post page fails to load: log a warning, skip that post
+- If a field cannot be found: store empty string (no crash)
+- On keyboard interrupt: flush any collected data to output files before exiting
+
+## Dependencies
+
+- `playwright` (Python) — browser automation
+- Standard library only for CSV, Markdown writing and regex for hashtag/mention extraction
+
+## Files
+
+```
+instagram/
+├── profiles.txt          # input: list of profile URLs
+├── scraper.py            # main script
+├── browser_profile/      # persistent Chromium session (gitignored)
+├── output.csv            # combined output
+└── output.md             # combined output
+```
+
+## Session Persistence
+
+The `./browser_profile/` directory holds the Chromium user data. Once authenticated, subsequent runs reuse the session without prompting. If Instagram logs the session out, the script detects this and prompts for re-authentication.
--- a/profiles.txt
+++ b/profiles.txt
@@ -0,0 +1,5 @@
+https://www.instagram.com/licmuunisul/
+https://www.instagram.com/ligamiunisulpb/
+https://www.instagram.com/lipracunisul/
+https://www.instagram.com/liaphunisul/
+https://www.instagram.com/lipali.unisul/
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,8 @@
+[project]
+name = "instagram-scraper"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+    "playwright>=1.40.0",
+]
+