1

I have more than 100 spiders and i want to run 5 spiders at a time using a script. For this i have created a table in database to know about the status of a spider i.e. whether it has finished running , running or waiting to run.
I know how to run multiple spiders inside a script

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
    process.start()

But this is not allowed as the following error occurs:

Traceback (most recent call last):
File "test.py", line 24, in <module>
  process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
  reactor.run(installSignalHandlers=False)  # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
  self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
  ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
  raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

I have searched for above error and not able to resolve it. Managing spiders can be done via ScrapyD but we do not want to use ScrapyD as many spiders are still in development phase.

Any workaround for above scenario is appreciated.

Thanks

Gaur93
  • 685
  • 7
  • 19

3 Answers3

2

For running multiple spiders simultaneously you can use this

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

The answers of this question can help you too.

For more information:

Running multiple spiders in the same process

aboutaaron
  • 4,869
  • 3
  • 36
  • 30
parik
  • 2,313
  • 12
  • 39
  • 67
  • i have tried all this but this does not work for me as i want them in a loop 5 at a time not all at the same time – Gaur93 Jan 31 '18 at 10:55
  • what do you mean by "does not work for me as i want them in a loop 5 at a time not all at the same time" ? – parik Jan 31 '18 at 11:51
  • if you add spiders like above in process.crawl then they will run concurrently. – Gaur93 Feb 01 '18 at 09:52
1

You need ScrapyD for this purpose

You can run as many spider as you want at the same time, you can constantly check status if a spider is running or not using listjobs API

You can set max_proc=5 in config file that will run maximum of 5 spiders at a single time.

Anyways, talking about your code, your code shoudl work if you do this

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
process.start()

You need to place process.start() outside of loop.

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • 1
    I do not want to use scrapyD as said here since most of the spiders are still in development phase. – Gaur93 Jan 31 '18 at 08:48
  • by doing `process.start()` outside for loop it will start 20 spiders at the same time – Gaur93 Jan 31 '18 at 08:49
  • @Gaur93 that does not matter, at least you have clean access to spider's logs and items while you use scrapyD, you can install scrapyD in localhost. – Umair Ayub Jan 31 '18 at 08:49
1

I was able to implement a similar functionality by removing loop from the script and setting a scheduler for every 3 minutes.

Looping functionality was achieved by maintaining a record of how many spiders are currently running and checking if more spiders need to be run or not.Thus at the end, only 5(can be changed) spiders can run concurrently.

Gaur93
  • 685
  • 7
  • 19
  • I'm also looking for a solution exactly like this. What library did you use for scheduling, add how do you query how many spiders are currently running? – NFB Feb 03 '18 at 16:36
  • 1
    @NFB I didn't use any library for scheduling. I wrote a scheduler myself. To handle how many spiders are currently running i stored status of individual spider in database. When a spider is started start_requests method is called first (if you have one), so change status of that spider to running. When a spider is closed/finished closed method is called so in that you can change status to finished or not running. Before running spiders you can then check how many spiders are currently running and how many do you want to run. – Gaur93 Feb 08 '18 at 04:51
  • Thanks for the info- how did you keep the code from blocking during crawling? – NFB Feb 08 '18 at 18:50
  • @NFB I have limited concurrent_requests to 2 and increased download_delay – Gaur93 Feb 09 '18 at 04:18