2

Here's the code:

if __name__ == '__main__':
    cmdline.execute("scrapy crawl spider_a -L INFO".split())
    cmdline.execute("scrapy crawl spider_b -L INFO".split())

I intend to run multiple spiders from within the same main portal under a scrapy project but it turns out that only the first spider has run successfully, whereas the second one seems like being ignored. Any suggestions?

KAs
  • 1,818
  • 4
  • 19
  • 37
  • Do you need them to run concurrently? – C. Feenstra Nov 22 '17 at 05:16
  • just sequentially is good @C.Feenstra – KAs Nov 22 '17 at 05:17
  • 1
    Common ways to approach the problem are [`CrawlerProcess`](https://stackoverflow.com/questions/39365131/running-multiple-spiders-in-scrapy-for-1-website-in-parallel) or [`Scrapyd`](https://scrapyd.readthedocs.io/en/stable/). – alecxe Nov 22 '17 at 05:20

2 Answers2

2

Just do

import subprocess

subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)
C. Feenstra
  • 593
  • 3
  • 11
0

From the scrapy documentation: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

import scrapy
from scrapy.crawler import CrawlerProcess

from .spiders import Spider1, Spider2

process = CrawlerProcess()
process.crawl(Crawler1)
process.crawl(Crawler2)
process.start() # the script will block here until all crawling jobs are finished

EDIT: If you wish to run multiple spiders two by two, you can do the following:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

configure_logging()
runner = CrawlerRunner()

spiders = [Spider1, Spider2, Spider3, Spider4]

def join_spiders(spiders):
   """Setup a new runner with the provided spiders"""

   runner = CrawlerRunner()

   # Add each spider to the current runner
   for spider in spider:
       runner.crawl(MySpider1)

   # This will yield when all the spiders inside the runner finished
   yield runner.join()

@defer.inlineCallbacks
def crawl(group_by=2):
   # Yield a new runner containing `group_by` spiders
   for i in range(0, len(spiders), step=group_by):
       yield join_spiders(spiders[i:i + group_by])

   # When we finished running all the spiders, stop the twisted reactor
   reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Didn't tested all of this though, let me know if it works !

Clément Denoix
  • 1,504
  • 11
  • 18
  • Thanks buddy, these two spiders will then run concurrently, any idea how to run it with a max concurrent number? Say, we have 10 spiders , I intend to start all of them via `CrawlerProcess`, how could I limit only 2 spiders could run simultaneously? Thanks! – KAs Nov 22 '17 at 11:36
  • Edited my answer. Didn't tested all my code though... Let me know if it works. – Clément Denoix Nov 22 '17 at 13:04