studyeventz

πŸŽ“ Australian University Agent Scraper & Database

A complete system to scrape, store, query, and report on publicly listed education agents from Australian university websites.


Setup

cd agent_scraper
pip install -r requirements.txt

1. Scraping (scrape.py)

First run β€” scrape all universities

python scrape.py

Check status without scraping

python scrape.py --list

Scrape a specific university (partial name match)

python scrape.py --uni "Monash"
python scrape.py --uni "Melbourne"
python scrape.py --uni "Queensland"

Re-scrape everything (refresh existing data)

python scrape.py --refresh

Just load the university list into the DB (no scraping)

python scrape.py --load-only

How the scraper works: The scraper tries multiple strategies per page, in order:

  1. HTML tables β€” most structured pages
  2. Repeated card/list elements β€” modern styled pages
  3. Definition lists β€” some older university sites
  4. Email-anchored text blocks β€” pages with inline contact info
  5. Country/region headings + entries β€” common β€œagents by country” layout
  6. Raw text fallback β€” saves page preview for manual review

2. Querying (query.py)

Show all agents

python query.py agents

Filter by country

python query.py agents --country "China"
python query.py agents --country "India"
python query.py agents --country "Indonesia"

Filter by university

python query.py agents --university "Monash"

Find agents with email addresses

python query.py agents --has-email

Search by email domain

python query.py agents --email "@gmail.com"
python query.py agents --email "education"

Full-text search (searches name, email, raw text)

python query.py search "IDP"
python query.py search "education group"
python query.py search "Beijing"

Statistics

python query.py stats                    # By country (default)
python query.py stats --by university    # By university
python query.py stats --by country_university  # Cross-tab

Coverage overview

python query.py coverage

Export to Excel

python query.py agents --country "Vietnam" --export vietnam_agents.xlsx
python query.py stats --by country --export country_stats.xlsx
python query.py coverage --export coverage.xlsx

3. Social Media Reports (social_report.py)

Full report (HTML + Excel)

python social_report.py

Outputs to ./reports/ directory.

Country-specific report

python social_report.py --country "China"
python social_report.py --country "India"

University-specific report

python social_report.py --university "Monash"

Format options

python social_report.py --format html      # HTML only
python social_report.py --format excel     # Excel only
python social_report.py --format both      # Both (default)

Custom output directory

python social_report.py --output ./my_reports/

Report contents:


File Structure

agent_scraper/
β”œβ”€β”€ scrape.py          β€” Main scraper
β”œβ”€β”€ query.py           β€” Database query CLI
β”œβ”€β”€ social_report.py   β€” Report generator
β”œβ”€β”€ requirements.txt   β€” Python dependencies
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ australian_university_agent_pages.xlsx  β€” Source URL list
β”‚   └── agents.db      β€” SQLite database (created on first run)
└── reports/           β€” Generated reports (created automatically)

Database Schema

universities β€” 42 Australian universities with their agent page URLs
agents β€” Individual agent records (company, contact, country, email, phone, website)
scrape_log β€” History of all scrape attempts


Notes on Coverage

Some university pages use JavaScript-rendered finders (e.g. Macquarie, some QLD universities). These won’t be scraped by this tool as they require a browser. Pages like these will show scrape_status = 'raw_text_fallback' β€” you’ll need to visit those pages manually and copy/paste agent lists, or use a Selenium-based scraper for those specific URLs.

Pages confirmed to have static HTML agent lists (best scrape results expected):