Building a SERP Monitoring Pipeline That Actually Scales

The Naive Approach Breaks at 10K

A simple cron + worker pipeline works fine until you cross ~10K keywords/day. Past that, you need request batching, regional partitioning, and an exit-node scheduler that knows about Google's per-IP limits.

The Architecture

Producer (keyword scheduler) → Kafka → Worker pool with proxy-pool affinity → Parser → BigQuery sink. Idempotent retries on 429/503, dead-letter queue for permanent fails, and a metrics layer to spot regressions early.

Cost Reality

At 100M queries/month, proxy cost dominates. Hybrid routing (datacenter first → residential on retry) cut our blended cost-per-query by 38% versus residential-only.