Scrapy: Running multiple spiders from the same python process via cmdLine fails

Question

Here's the code:

if __name__ == '__main__':
    cmdline.execute("scrapy crawl spider_a -L INFO".split())
    cmdline.execute("scrapy crawl spider_b -L INFO".split())

I intend to run multiple spiders from within the same main portal under a scrapy project but it turns out that only the first spider has run successfully, whereas the second one seems like being ignored. Any suggestions?

Common ways to approach the problem are [`CrawlerProcess`](https://stackoverflow.com/questions/39365131/running-multiple-spiders-in-scrapy-for-1-website-in-parallel) or [`Scrapyd`](https://scrapyd.readthedocs.io/en/stable/). — alecxe, Nov 22 '17 at 05:20

score 2 · Accepted Answer · answered Nov 22 '17 at 05:20

2

Just do

import subprocess

subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)

answered Nov 22 '17 at 05:20

C. Feenstra

593
3
11

This is not a really efficient way to do it. It will create one twisted `Reactor` for each of the spiders. And they are memory consuming... – Clément Denoix Nov 22 '17 at 08:32
Reasonable; your answer is undoubtedly the correct way to do this. – C. Feenstra Nov 22 '17 at 16:50

Clément Denoix · Answer 2 · 2017-11-22T13:03:38.380

From the scrapy documentation: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

import scrapy
from scrapy.crawler import CrawlerProcess

from .spiders import Spider1, Spider2

process = CrawlerProcess()
process.crawl(Crawler1)
process.crawl(Crawler2)
process.start() # the script will block here until all crawling jobs are finished

EDIT: If you wish to run multiple spiders two by two, you can do the following:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

configure_logging()
runner = CrawlerRunner()

spiders = [Spider1, Spider2, Spider3, Spider4]

def join_spiders(spiders):
   """Setup a new runner with the provided spiders"""

   runner = CrawlerRunner()

   # Add each spider to the current runner
   for spider in spider:
       runner.crawl(MySpider1)

   # This will yield when all the spiders inside the runner finished
   yield runner.join()

@defer.inlineCallbacks
def crawl(group_by=2):
   # Yield a new runner containing `group_by` spiders
   for i in range(0, len(spiders), step=group_by):
       yield join_spiders(spiders[i:i + group_by])

   # When we finished running all the spiders, stop the twisted reactor
   reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Didn't tested all of this though, let me know if it works !

Thanks buddy, these two spiders will then run concurrently, any idea how to run it with a max concurrent number? Say, we have 10 spiders , I intend to start all of them via `CrawlerProcess`, how could I limit only 2 spiders could run simultaneously? Thanks! — KAs, Nov 22 '17 at 11:36
Edited my answer. Didn't tested all my code though... Let me know if it works. — Clément Denoix, Nov 22 '17 at 13:04

Scrapy: Running multiple spiders from the same python process via cmdLine fails

2 Answers2