docs: add README, CLAUDE.md, and anonymised profiles.txt for public release
This commit is contained in:
110
README.md
Normal file
110
README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# Instagram Profile Scraper
|
||||
|
||||
Scrapes the last N posts from a list of Instagram profiles and saves the results as CSV and Markdown. Uses Playwright with a real authenticated browser session — no API keys required.
|
||||
|
||||
## What it does
|
||||
|
||||
1. Reads a list of Instagram profile URLs from `profiles.txt`
|
||||
2. Opens a Chromium browser (visible, non-headless)
|
||||
3. On first run: prompts you to log in manually, then saves the session to `auth_state.json`
|
||||
4. On subsequent runs: reuses the saved session automatically
|
||||
5. Visits each profile, collects the last 5 post URLs
|
||||
6. Visits each post and extracts: date, caption, likes, image URLs, hashtags, mentions, location, media type
|
||||
7. Writes combined results to `output.csv` and `output.md`
|
||||
|
||||
## Setup
|
||||
|
||||
Requires Python 3.11+ and [uv](https://docs.astral.sh/uv/).
|
||||
|
||||
```bash
|
||||
git clone <repo-url>
|
||||
cd instagram-scraper
|
||||
uv sync
|
||||
uv run playwright install chromium
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
**`profiles.txt`** — one Instagram profile URL per line. Lines can optionally be prefixed with a number and tab (the scraper strips them):
|
||||
|
||||
```
|
||||
https://www.instagram.com/username1/
|
||||
https://www.instagram.com/username2/
|
||||
https://www.instagram.com/username3/
|
||||
```
|
||||
|
||||
**Constants in `scraper.py`** (edit directly):
|
||||
|
||||
| Constant | Default | Description |
|
||||
|---|---|---|
|
||||
| `POSTS_PER_PROFILE` | `5` | How many posts to scrape per profile |
|
||||
| `PROFILES_FILE` | `profiles.txt` | Input file path |
|
||||
| `OUTPUT_CSV` | `output.csv` | CSV output path |
|
||||
| `OUTPUT_MD` | `output.md` | Markdown output path |
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
uv run python scraper.py
|
||||
```
|
||||
|
||||
On first run, a Chromium window opens. Log in to Instagram, then press Enter in the terminal. The session is saved to `auth_state.json` and reused on future runs.
|
||||
|
||||
If Instagram logs you out, delete `auth_state.json` and run again.
|
||||
|
||||
## Output
|
||||
|
||||
### CSV (`output.csv`)
|
||||
|
||||
One row per post with these columns:
|
||||
|
||||
| Column | Description |
|
||||
|---|---|
|
||||
| `profile` | Instagram username |
|
||||
| `post_url` | Full URL of the post |
|
||||
| `date` | ISO 8601 datetime (e.g. `2026-02-23T15:28:13.000Z`) |
|
||||
| `caption` | Full post caption text |
|
||||
| `likes` | Like count (as displayed) |
|
||||
| `image_urls` | Comma-separated CDN image URLs |
|
||||
| `hashtags` | Comma-separated hashtags from caption |
|
||||
| `mentions` | Comma-separated @mentions from caption |
|
||||
| `location` | Location tag text (empty if none) |
|
||||
| `media_type` | `photo`, `video`, or `carousel` |
|
||||
|
||||
### Markdown (`output.md`)
|
||||
|
||||
Same data, grouped by profile. Each post is a section with all fields as a bullet list.
|
||||
|
||||
## Error handling
|
||||
|
||||
- Private or unavailable profiles: skipped with a warning, scraping continues
|
||||
- Individual post failures: skipped with a warning, scraping continues
|
||||
- Missing fields: stored as empty string, no crash
|
||||
- Keyboard interrupt (Ctrl+C): saves whatever has been collected so far
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
instagram-scraper/
|
||||
├── scraper.py # main script
|
||||
├── profiles.txt # input: list of profile URLs
|
||||
├── pyproject.toml # project metadata and dependencies
|
||||
├── tests/
|
||||
│ └── test_parsers.py # unit tests for parsing functions
|
||||
├── auth_state.json # saved session (created on first run, gitignored)
|
||||
├── output.csv # results (gitignored)
|
||||
└── output.md # results (gitignored)
|
||||
```
|
||||
|
||||
## Running tests
|
||||
|
||||
```bash
|
||||
uv run pytest tests/ -v
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Works with public profiles. Private profiles are skipped.
|
||||
- Instagram rate-limits aggressive scraping. The script adds a 1.5s wait between post requests.
|
||||
- Session cookies expire periodically. Delete `auth_state.json` to re-authenticate.
|
||||
- Image URLs are CDN URLs that expire after some time — download them promptly if needed.
|
||||
Reference in New Issue
Block a user