2

I've seen people using Airflow to schedule hundreds of scraping jobs through Scrapyd daemons. However, one thing they miss in Airflow is monitoring long-lasting jobs like scraping: getting number of pages and items scraped so far, number of URL that failed so far or were retried without success.

What are my options to monitor current status of long lasting jobs? Is there something already available or I need to resort to external solutions like Prometheus, Grafana and instrument Scrapy spiders myself?

dzieciou
  • 4,049
  • 8
  • 41
  • 85

1 Answers1

3

We've had better luck keeping our airflow jobs short and sweet.

With long-running tasks, you risk running into queue back-ups. And we've found the parallelism limits are not quite intuitive. Check out this post for a good breakdown.

In a case kind of like yours, when there's a lot of work to orchestrate and potentially retry, we reach for Kafka. The airflow dags pull messages off of a Kafka topic and then report success/failure via a Kafka wrapper.

We end up with several overlapping airflow tasks running in "micro-batches" reading a tune-able number of messages off Kafka, with the goal of keeping each airflow task under a certain run time.

By keeping the work small in each airflow task, we can easily scale the number of messages up or down to tune the overall task run time with the overall parallelism of the airflow workers.

Seems like you could explore something similar here?

Hope that helps!