I've been doing some webscraping. I wanted to be able to run this in parallel, so I needed to add some proxies to avoid hitting the same site with too many requests from the same IP. I found that I needed some logic for webscraping, but also a managed proxy pool (as proxies may go bad), and a managed cache. I decided to implement this, then, as a gRPC service hosted on my local network.

I implemented various monitoring and logging to help keep this healthy as I schedule things to run every day.

With lots of design decisions, I wrote a devlog, which is represented here. The code remains private for security.

Webscraper 1

Requirements, basic design decisions, and technology choices.

2024Link

Webscraper 2

Create first gRPC service; will just echo for now.

2024Link

Webscraper 3

Start logging and reporting

2024Link

Webscraper 4

Create Linux service for scraper. Shop for DBs. Setup Prometheus.

2024Link

Webscraper 5

Change db to write-then-read with cache before returning response. Later will have a different worker do this.

2024Link

Webscraper 6

Pause implementation to consider different API patterns.

2024Link

Webscraper 7

After a break, added some extra documentation and tweaked some existing commands.

2024Link

Webscraper 8

Create backend worker. Played more with logging, trying fluent-bit. Switched from LevelsDB to sqlite to fix parallel access.

2024Link

Webscraper 9

Implemented proxy pools. Switch from expiration to freshness pattern.

2024Link

Webscraper 10

Migrate everything over to docker. Use an ELK stack.

2024Link

Scraper 11

Switched from sqlite to file directory. Added compression.

2024Link