1

I have multiple Scrapy spiders that I need to run at the same time every 5 minutes. The issue is that they take almost 30 sec to 1 minute to start.

It's seem that they all start their own twisted engine, and so it take a lot of time.

I've look into different ways to run multiple spiders at the same time (see Running Multiple spiders in scrapy for 1 website in parallel?), but I need to have a log for each spider and a process per spider to integrate well with Airflow.

I've look into scrapyd, but it doesn't seem to share a twisted engine for multiple spiders, is that correct ?

Are they different ways I could achieve my goals ?

fast_cen
  • 1,297
  • 3
  • 11
  • 28
  • 1
    `they take almost 30 sec to 1 minute to start.` ??? :P are you running Scrapy on a potato? – Umair Ayub Feb 26 '18 at 09:58
  • I have around 10 spiders on a (4 vCPUs - Intel Sandy Bridge, 4.75 GB memory) machine. Is that too low? – fast_cen Feb 26 '18 at 10:37
  • Yup, I have never had a scraper taking so much time to start – Umair Ayub Feb 26 '18 at 10:54
  • There was some python module that was loading a big file, so I removed it, and now it's 10sec faster. However, it's still takes 20/30sec from the moment of CMD to the first output of the script... So, is there anyway to share a twisted engine for multiple spiders? – fast_cen Feb 26 '18 at 12:12
  • Note that when I launch only one spider, I take 5~6s to load. The 20~30 sec are when the 10 spiders start at the same time. – fast_cen Feb 26 '18 at 12:15
  • we done this by writing a gateway, the gateway command scrapyd to run spiders, it sends requests concurrently – mirhossein Aug 18 '18 at 10:04

0 Answers0