chore: init project with playwright dependency
This commit is contained in:
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
browser_profile/
|
||||||
|
output.csv
|
||||||
|
output.md
|
||||||
|
__pycache__/
|
||||||
|
.venv/
|
||||||
|
*.pyc
|
||||||
525
docs/superpowers/plans/2026-03-29-instagram-scraper.md
Normal file
525
docs/superpowers/plans/2026-03-29-instagram-scraper.md
Normal file
@@ -0,0 +1,525 @@
|
|||||||
|
# Instagram Scraper Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Scrape the last 5 posts from each Instagram profile in `profiles.txt` and save combined output as `output.csv` and `output.md`.
|
||||||
|
|
||||||
|
**Architecture:** Single `scraper.py` script using Playwright sync API with a persistent Chromium profile (`./browser_profile/`). Pure parsing functions are unit-tested; browser interaction is manually tested end-to-end. Auth state persists between runs.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3.11+, Playwright (sync), uv for package management, stdlib csv/re for output.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
| File | Responsibility |
|
||||||
|
|---|---|
|
||||||
|
| `pyproject.toml` | Project metadata and dependencies |
|
||||||
|
| `.gitignore` | Exclude browser_profile/, output files |
|
||||||
|
| `scraper.py` | All logic: auth check, profile reading, scraping, output writing |
|
||||||
|
| `tests/test_parsers.py` | Unit tests for pure parsing functions |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Project Setup
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `pyproject.toml`
|
||||||
|
- Create: `.gitignore`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create pyproject.toml**
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[project]
|
||||||
|
name = "instagram-scraper"
|
||||||
|
version = "0.1.0"
|
||||||
|
requires-python = ">=3.11"
|
||||||
|
dependencies = [
|
||||||
|
"playwright>=1.40.0",
|
||||||
|
]
|
||||||
|
|
||||||
|
[build-system]
|
||||||
|
requires = ["hatchling"]
|
||||||
|
build-backend = "hatchling.build"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create .gitignore**
|
||||||
|
|
||||||
|
```
|
||||||
|
browser_profile/
|
||||||
|
output.csv
|
||||||
|
output.md
|
||||||
|
__pycache__/
|
||||||
|
.venv/
|
||||||
|
*.pyc
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Install dependencies and Playwright browsers**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv sync
|
||||||
|
uv run playwright install chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: Chromium browser downloaded successfully.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git init
|
||||||
|
git add pyproject.toml .gitignore profiles.txt
|
||||||
|
git commit -m "chore: init project with playwright dependency"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Pure Parser Functions + Tests
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `scraper.py` (parser functions only)
|
||||||
|
- Create: `tests/test_parsers.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Create `tests/__init__.py` (empty), then `tests/test_parsers.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import sys, os
|
||||||
|
sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
|
||||||
|
from scraper import extract_hashtags, extract_mentions, profile_slug_from_url
|
||||||
|
|
||||||
|
def test_extract_hashtags_basic():
|
||||||
|
assert extract_hashtags("Hello #world #foo") == ["#world", "#foo"]
|
||||||
|
|
||||||
|
def test_extract_hashtags_empty():
|
||||||
|
assert extract_hashtags("No tags here") == []
|
||||||
|
|
||||||
|
def test_extract_hashtags_deduplicates():
|
||||||
|
assert extract_hashtags("#foo #foo #bar") == ["#foo", "#bar"]
|
||||||
|
|
||||||
|
def test_extract_mentions_basic():
|
||||||
|
assert extract_mentions("Hey @alice and @bob") == ["@alice", "@bob"]
|
||||||
|
|
||||||
|
def test_extract_mentions_empty():
|
||||||
|
assert extract_mentions("No mentions") == []
|
||||||
|
|
||||||
|
def test_profile_slug_from_url():
|
||||||
|
assert profile_slug_from_url("https://www.instagram.com/licmuunisul/") == "licmuunisul"
|
||||||
|
|
||||||
|
def test_profile_slug_trailing_slash():
|
||||||
|
assert profile_slug_from_url("https://www.instagram.com/licmuunisul") == "licmuunisul"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run tests to verify they fail**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_parsers.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `ImportError` or `ModuleNotFoundError` (scraper.py doesn't exist yet).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Implement parser functions in scraper.py**
|
||||||
|
|
||||||
|
Create `scraper.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import re
|
||||||
|
import csv
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from playwright.sync_api import sync_playwright, Page
|
||||||
|
|
||||||
|
PROFILES_FILE = Path("profiles.txt")
|
||||||
|
OUTPUT_CSV = Path("output.csv")
|
||||||
|
OUTPUT_MD = Path("output.md")
|
||||||
|
BROWSER_PROFILE = Path("browser_profile")
|
||||||
|
POSTS_PER_PROFILE = 5
|
||||||
|
|
||||||
|
|
||||||
|
def extract_hashtags(text: str) -> list[str]:
|
||||||
|
seen = set()
|
||||||
|
result = []
|
||||||
|
for tag in re.findall(r"#\w+", text):
|
||||||
|
if tag not in seen:
|
||||||
|
seen.add(tag)
|
||||||
|
result.append(tag)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def extract_mentions(text: str) -> list[str]:
|
||||||
|
seen = set()
|
||||||
|
result = []
|
||||||
|
for mention in re.findall(r"@\w+", text):
|
||||||
|
if mention not in seen:
|
||||||
|
seen.add(mention)
|
||||||
|
result.append(mention)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def profile_slug_from_url(url: str) -> str:
|
||||||
|
return url.rstrip("/").split("/")[-1]
|
||||||
|
|
||||||
|
|
||||||
|
def read_profiles() -> list[str]:
|
||||||
|
urls = []
|
||||||
|
for line in PROFILES_FILE.read_text().splitlines():
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
# Lines may be prefixed with a number and tab
|
||||||
|
parts = line.split("\t")
|
||||||
|
url = parts[-1].strip()
|
||||||
|
if url.startswith("http"):
|
||||||
|
urls.append(url)
|
||||||
|
return urls
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run tests to verify they pass**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_parsers.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: all 7 tests PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scraper.py tests/
|
||||||
|
git commit -m "feat: add parser functions with tests"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Auth Check
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scraper.py` (add `is_logged_in`, `ensure_authenticated`)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add auth functions to scraper.py**
|
||||||
|
|
||||||
|
Append after the `read_profiles` function:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def is_logged_in(page: Page) -> bool:
|
||||||
|
page.goto("https://www.instagram.com/", wait_until="networkidle", timeout=30000)
|
||||||
|
# Logged in: shows feed or home icon. Not logged in: shows login form.
|
||||||
|
return page.locator("input[name='username']").count() == 0
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_authenticated(page: Page) -> None:
|
||||||
|
if is_logged_in(page):
|
||||||
|
print("[auth] Session active, proceeding.")
|
||||||
|
return
|
||||||
|
print("[auth] Not logged in. Please log in to Instagram in the browser window.")
|
||||||
|
print("[auth] Press Enter here when you are logged in and can see your feed...")
|
||||||
|
input()
|
||||||
|
# Verify login succeeded
|
||||||
|
if not is_logged_in(page):
|
||||||
|
print("[auth] Still not logged in. Please try again and restart the script.")
|
||||||
|
sys.exit(1)
|
||||||
|
print("[auth] Login confirmed.")
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scraper.py
|
||||||
|
git commit -m "feat: add auth check and manual login wait"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Post URL Collection from Profile Grid
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scraper.py` (add `get_post_urls`)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add post URL collector**
|
||||||
|
|
||||||
|
Append after `ensure_authenticated`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def get_post_urls(page: Page, profile_url: str, count: int = POSTS_PER_PROFILE) -> list[str]:
|
||||||
|
slug = profile_slug_from_url(profile_url)
|
||||||
|
print(f"[{slug}] Navigating to profile...")
|
||||||
|
try:
|
||||||
|
page.goto(profile_url, wait_until="networkidle", timeout=30000)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[{slug}] Failed to load profile: {e}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Wait for posts grid
|
||||||
|
try:
|
||||||
|
page.wait_for_selector("article a[href*='/p/']", timeout=15000)
|
||||||
|
except Exception:
|
||||||
|
print(f"[{slug}] No posts found or profile is private.")
|
||||||
|
return []
|
||||||
|
|
||||||
|
links = page.locator("article a[href*='/p/']").all()
|
||||||
|
seen = set()
|
||||||
|
urls = []
|
||||||
|
for link in links:
|
||||||
|
href = link.get_attribute("href")
|
||||||
|
if href and href not in seen:
|
||||||
|
seen.add(href)
|
||||||
|
urls.append("https://www.instagram.com" + href)
|
||||||
|
if len(urls) >= count:
|
||||||
|
break
|
||||||
|
|
||||||
|
print(f"[{slug}] Found {len(urls)} post URLs.")
|
||||||
|
return urls
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scraper.py
|
||||||
|
git commit -m "feat: collect post URLs from profile grid"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Individual Post Scraper
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scraper.py` (add `scrape_post`)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add post scraper**
|
||||||
|
|
||||||
|
Append after `get_post_urls`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def scrape_post(page: Page, post_url: str, profile_slug: str) -> dict:
|
||||||
|
print(f" Scraping {post_url}")
|
||||||
|
result = {
|
||||||
|
"profile": profile_slug,
|
||||||
|
"post_url": post_url,
|
||||||
|
"date": "",
|
||||||
|
"caption": "",
|
||||||
|
"likes": "",
|
||||||
|
"image_urls": "",
|
||||||
|
"hashtags": "",
|
||||||
|
"mentions": "",
|
||||||
|
"location": "",
|
||||||
|
"media_type": "",
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
page.goto(post_url, wait_until="networkidle", timeout=30000)
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Failed to load post: {e}")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Date
|
||||||
|
time_el = page.locator("time[datetime]").first
|
||||||
|
if time_el.count():
|
||||||
|
result["date"] = time_el.get_attribute("datetime") or ""
|
||||||
|
|
||||||
|
# Caption — expand "more" if present
|
||||||
|
more_btn = page.locator("span[role='button']").filter(has_text=re.compile(r"more", re.I))
|
||||||
|
if more_btn.count():
|
||||||
|
try:
|
||||||
|
more_btn.first.click()
|
||||||
|
page.wait_for_timeout(500)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
caption_el = page.locator("article h1, article div[data-testid='post-comment-root'] span").first
|
||||||
|
if caption_el.count():
|
||||||
|
result["caption"] = caption_el.inner_text().strip()
|
||||||
|
|
||||||
|
# Likes
|
||||||
|
likes_el = page.locator("section span:has-text('like'), section a:has-text('like')").first
|
||||||
|
if likes_el.count():
|
||||||
|
result["likes"] = likes_el.inner_text().strip()
|
||||||
|
else:
|
||||||
|
# Fallback: aria-label on like button section
|
||||||
|
like_section = page.locator("section._ae2s, section[class*='like']").first
|
||||||
|
if like_section.count():
|
||||||
|
result["likes"] = like_section.inner_text().strip()
|
||||||
|
|
||||||
|
# Media type + image URLs
|
||||||
|
carousel = page.locator("div[data-testid='media-number-indicator'], button[aria-label*='Next']")
|
||||||
|
video = page.locator("video")
|
||||||
|
if carousel.count():
|
||||||
|
result["media_type"] = "carousel"
|
||||||
|
elif video.count():
|
||||||
|
result["media_type"] = "video"
|
||||||
|
else:
|
||||||
|
result["media_type"] = "photo"
|
||||||
|
|
||||||
|
imgs = page.locator("article img[src]").all()
|
||||||
|
img_urls = [img.get_attribute("src") for img in imgs if img.get_attribute("src")]
|
||||||
|
result["image_urls"] = ", ".join(img_urls)
|
||||||
|
|
||||||
|
# Hashtags and mentions from caption
|
||||||
|
result["hashtags"] = ", ".join(extract_hashtags(result["caption"]))
|
||||||
|
result["mentions"] = ", ".join(extract_mentions(result["caption"]))
|
||||||
|
|
||||||
|
# Location
|
||||||
|
loc_el = page.locator("a[href*='/explore/locations/']").first
|
||||||
|
if loc_el.count():
|
||||||
|
result["location"] = loc_el.inner_text().strip()
|
||||||
|
|
||||||
|
return result
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scraper.py
|
||||||
|
git commit -m "feat: implement individual post scraper"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Output Writers
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scraper.py` (add `write_csv`, `write_markdown`)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add output writers**
|
||||||
|
|
||||||
|
Append after `scrape_post`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
FIELDS = ["profile", "post_url", "date", "caption", "likes", "image_urls",
|
||||||
|
"hashtags", "mentions", "location", "media_type"]
|
||||||
|
|
||||||
|
|
||||||
|
def write_csv(posts: list[dict]) -> None:
|
||||||
|
with OUTPUT_CSV.open("w", newline="", encoding="utf-8") as f:
|
||||||
|
writer = csv.DictWriter(f, fieldnames=FIELDS)
|
||||||
|
writer.writeheader()
|
||||||
|
writer.writerows(posts)
|
||||||
|
print(f"[output] CSV saved to {OUTPUT_CSV} ({len(posts)} posts)")
|
||||||
|
|
||||||
|
|
||||||
|
def write_markdown(posts: list[dict]) -> None:
|
||||||
|
from itertools import groupby
|
||||||
|
with OUTPUT_MD.open("w", encoding="utf-8") as f:
|
||||||
|
f.write("# Instagram Scrape Results\n\n")
|
||||||
|
for profile, group in groupby(posts, key=lambda p: p["profile"]):
|
||||||
|
f.write(f"## {profile}\n\n")
|
||||||
|
for post in group:
|
||||||
|
f.write(f"### [{post['post_url']}]({post['post_url']})\n\n")
|
||||||
|
for field in FIELDS:
|
||||||
|
if field in ("profile", "post_url"):
|
||||||
|
continue
|
||||||
|
value = post.get(field, "")
|
||||||
|
if value:
|
||||||
|
f.write(f"- **{field}:** {value}\n")
|
||||||
|
f.write("\n")
|
||||||
|
print(f"[output] Markdown saved to {OUTPUT_MD}")
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scraper.py
|
||||||
|
git commit -m "feat: add CSV and markdown output writers"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Main Entry Point
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scraper.py` (add `main` function and `__main__` block)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add main function**
|
||||||
|
|
||||||
|
Append at the end of `scraper.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def main():
|
||||||
|
profiles = read_profiles()
|
||||||
|
if not profiles:
|
||||||
|
print("No profiles found in profiles.txt")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"[main] Loaded {len(profiles)} profiles.")
|
||||||
|
all_posts = []
|
||||||
|
|
||||||
|
BROWSER_PROFILE.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
with sync_playwright() as p:
|
||||||
|
browser = p.chromium.launch_persistent_context(
|
||||||
|
user_data_dir=str(BROWSER_PROFILE),
|
||||||
|
headless=False,
|
||||||
|
viewport={"width": 1280, "height": 900},
|
||||||
|
)
|
||||||
|
page = browser.new_page()
|
||||||
|
|
||||||
|
ensure_authenticated(page)
|
||||||
|
|
||||||
|
try:
|
||||||
|
for profile_url in profiles:
|
||||||
|
slug = profile_slug_from_url(profile_url)
|
||||||
|
post_urls = get_post_urls(page, profile_url)
|
||||||
|
for post_url in post_urls:
|
||||||
|
post = scrape_post(page, post_url, slug)
|
||||||
|
all_posts.append(post)
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n[main] Interrupted. Saving collected data...")
|
||||||
|
|
||||||
|
browser.close()
|
||||||
|
|
||||||
|
if all_posts:
|
||||||
|
write_csv(all_posts)
|
||||||
|
write_markdown(all_posts)
|
||||||
|
else:
|
||||||
|
print("[main] No posts collected.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run the parser tests one final time to confirm nothing broke**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_parsers.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: all 7 tests PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add scraper.py
|
||||||
|
git commit -m "feat: wire main entry point, complete scraper"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 8: End-to-End Test
|
||||||
|
|
||||||
|
- [ ] **Step 1: Run the scraper**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run python scraper.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected:
|
||||||
|
- Chromium opens
|
||||||
|
- If not logged in: prompted to log in, then press Enter
|
||||||
|
- Script visits each of the 5 profiles, collects 5 post URLs each
|
||||||
|
- Visits each post (up to 25 total) and scrapes data
|
||||||
|
- Writes `output.csv` and `output.md`
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify CSV output**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
head -3 output.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: header row + at least 2 data rows with profile, post_url, date populated.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify Markdown output**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
head -30 output.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `# Instagram Scrape Results` heading, profile sections, post subsections with bullet fields.
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
# Instagram Scraper — Design Spec
|
||||||
|
|
||||||
|
**Date:** 2026-03-29
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
A single Python script (`scraper.py`) that uses Playwright with a persistent Chromium browser profile to scrape the last 5 posts from each Instagram profile listed in `profiles.txt`. Output is a combined `output.csv` and `output.md`.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Single script, no external services. Persistent browser profile (`./browser_profile/`) stores cookies/session so authentication only needs to happen once.
|
||||||
|
|
||||||
|
## Flow
|
||||||
|
|
||||||
|
1. Launch Chromium with `user_data_dir=./browser_profile/` (non-headless)
|
||||||
|
2. Check if already authenticated (detect Instagram home feed); if not, pause and wait for user to log in manually, then press Enter to continue
|
||||||
|
3. Read `profiles.txt`, extract profile URLs (skip blank lines and line numbers)
|
||||||
|
4. For each profile:
|
||||||
|
- Navigate to the profile page
|
||||||
|
- Wait for the posts grid to load
|
||||||
|
- Collect the first 5 post links from the grid
|
||||||
|
- Visit each post page
|
||||||
|
- Scrape all data fields (see below)
|
||||||
|
5. Write all scraped posts to `output.csv` and `output.md`
|
||||||
|
|
||||||
|
## Data Fields
|
||||||
|
|
||||||
|
| Field | Description |
|
||||||
|
|---|---|
|
||||||
|
| `profile` | Profile username (from profiles.txt) |
|
||||||
|
| `post_url` | Full URL of the post |
|
||||||
|
| `date` | ISO datetime from `<time>` element's `datetime` attribute |
|
||||||
|
| `caption` | Full caption text (expand "more" button if present) |
|
||||||
|
| `likes` | Like count (integer or raw string if not parseable) |
|
||||||
|
| `image_urls` | Comma-separated list of image/video src URLs in the post |
|
||||||
|
| `hashtags` | Hashtags extracted from caption (comma-separated) |
|
||||||
|
| `mentions` | @mentions extracted from caption (comma-separated) |
|
||||||
|
| `location` | Location tag text if present, else empty |
|
||||||
|
| `media_type` | One of: `photo`, `video`, `carousel` |
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
**CSV (`output.csv`):** One row per post, columns matching data fields above.
|
||||||
|
|
||||||
|
**Markdown (`output.md`):** Grouped by profile. Each post is a second-level section with all fields as a bullet list.
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
- If a profile page fails to load (404, private, etc.): log a warning, skip that profile, continue
|
||||||
|
- If a post page fails to load: log a warning, skip that post
|
||||||
|
- If a field cannot be found: store empty string (no crash)
|
||||||
|
- On keyboard interrupt: flush any collected data to output files before exiting
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `playwright` (Python) — browser automation
|
||||||
|
- Standard library only for CSV, Markdown writing and regex for hashtag/mention extraction
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
```
|
||||||
|
instagram/
|
||||||
|
├── profiles.txt # input: list of profile URLs
|
||||||
|
├── scraper.py # main script
|
||||||
|
├── browser_profile/ # persistent Chromium session (gitignored)
|
||||||
|
├── output.csv # combined output
|
||||||
|
└── output.md # combined output
|
||||||
|
```
|
||||||
|
|
||||||
|
## Session Persistence
|
||||||
|
|
||||||
|
The `./browser_profile/` directory holds the Chromium user data. Once authenticated, subsequent runs reuse the session without prompting. If Instagram logs the session out, the script detects this and prompts for re-authentication.
|
||||||
5
profiles.txt
Normal file
5
profiles.txt
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
https://www.instagram.com/licmuunisul/
|
||||||
|
https://www.instagram.com/ligamiunisulpb/
|
||||||
|
https://www.instagram.com/lipracunisul/
|
||||||
|
https://www.instagram.com/liaphunisul/
|
||||||
|
https://www.instagram.com/lipali.unisul/
|
||||||
8
pyproject.toml
Normal file
8
pyproject.toml
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
[project]
|
||||||
|
name = "instagram-scraper"
|
||||||
|
version = "0.1.0"
|
||||||
|
requires-python = ">=3.11"
|
||||||
|
dependencies = [
|
||||||
|
"playwright>=1.40.0",
|
||||||
|
]
|
||||||
|
|
||||||
Reference in New Issue
Block a user