Locally run all of the spiders in Scrapy

Question

Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl, but that syntax was removed and Scrapy's code changed quite a bit.

I tried creating my own command:

from scrapy.command import ScrapyCommand
from scrapy.utils.misc import load_object
from scrapy.conf import settings

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spman_cls = load_object(settings['SPIDER_MANAGER_CLASS'])
        spiders = spman_cls.from_settings(settings)

        for spider_name in spiders.list():
            spider = self.crawler.spiders.create(spider_name)
            self.crawler.crawl(spider)

        self.crawler.start()

But once a spider is registered with self.crawler.crawl(), I get assertion errors for all of the other spiders:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/home/blender/Projects/scrapers/store_crawler/store_crawler/commands/crawlall.py", line 22, in run
    self.crawler.crawl(spider)
  File "/usr/lib/python2.7/site-packages/scrapy/crawler.py", line 47, in crawl
    return self.engine.open_spider(spider, requests)
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/site-packages/scrapy/core/engine.py", line 215, in open_spider
    spider.name
exceptions.AssertionError: No free spider slots when opening 'spidername'

Is there any way to do this? I'd rather not start subclassing core Scrapy components just to run all of my spiders like this.

Do you know about [`scrapyd`](http://doc.scrapy.org/en/latest/topics/scrapyd.html)? — Steven Almeroth, Mar 22 '13 at 18:45
`0.16.4`. I do know about Scrapyd, but I'm testing these spiders locally, so I'd rather not use it. — Blender, Mar 22 '13 at 21:04

score 33 · Answer 1 · answered Dec 14 '14 at 16:58

33

Why didn't you just use something like:

scrapy list|xargs -n 1 scrapy crawl

?

answered Dec 14 '14 at 16:58

side2k

2,054
3
17
13

7

Use `-P 0` option for `xargs` to run all spiders in parallel. – rgtk Jul 15 '15 at 07:27
How to dynamically output json with name of the spider – xaander1 Sep 02 '21 at 16:56

Steven Almeroth · Accepted Answer · 2013-03-22T22:18:51.507

25

Here is an example that does not run inside a custom command, but runs the Reactor manually and creates a new Crawler for each spider:

from twisted.internet import reactor
from scrapy.crawler import Crawler
# scrapy.conf.settings singlton was deprecated last year
from scrapy.utils.project import get_project_settings
from scrapy import log

def setup_crawler(spider_name):
    crawler = Crawler(settings)
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()

log.start()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()

for spider_name in crawler.spiders.list():
    setup_crawler(spider_name)

reactor.run()

You will have to design some signal system to stop the reactor when all spiders are finished.

EDIT: And here is how you can run multiple spiders in a custom command:

from scrapy.command import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import Crawler

class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        settings = get_project_settings()

        for spider_name in self.crawler.spiders.list():
            crawler = Crawler(settings)
            crawler.configure()
            spider = crawler.spiders.create(spider_name)
            crawler.crawl(spider)
            crawler.start()

        self.crawler.start()

edited Mar 22 '13 at 22:18

answered Mar 22 '13 at 21:48

Steven Almeroth

7,758
2
50
57

Thank you, this is exactly what I was trying to do. – Blender Mar 22 '13 at 22:42
How to I star the program? – user1787687 Aug 10 '13 at 20:12
Put the code in a text editor and save as `mycoolcrawler.py`. In Linux you probably can run `python mycoolcrawler.py` from the command line in the directory you saved it in. In Windows maybe you can just double-click it from file-manager. – Steven Almeroth Aug 11 '13 at 15:39
Could you please explain the difference between `Crawler` and `Spider`? AFAIK, `Spider` controls things like how response processed (such as what items to scrape and how links extracted...), then what does `Crawler` do? – Alcott Sep 22 '14 at 08:25
@Alcott That's right, the Spider deals with the responses and the Crawler deals with Spiders: instantiates them, configures settings & middlewares, etc. – Steven Almeroth Sep 22 '14 at 18:49
Is there any reason to use a Crawler per spider as here, versus using a single crawler for many spiders? – yangmillstheory Jan 02 '15 at 17:04
Is it possible to run multiple instances of ONE spider with different params using `Scrapy`? – Shuai Zhang Jan 29 '15 at 15:00
@ShuaiZhang Surely you can do anything you want in separate processes. – Steven Almeroth Jan 31 '15 at 02:52

Soarone · Answer 3 · 2015-08-19T14:53:43.103

the answer of @Steven Almeroth will be failed in Scrapy 1.0, and you should edit the script like this:

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

class Command(ScrapyCommand):

    requires_project = True
    excludes = ['spider1']

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        settings = get_project_settings()
        crawler_process = CrawlerProcess(settings) 

        for spider_name in crawler_process.spider_loader.list():
            if spider_name in self.excludes:
                continue
            spider_cls = crawler_process.spider_loader.load(spider_name) 
            crawler_process.crawl(spider_cls)
        crawler_process.start()

Yuda Prawira · Answer 4 · 2019-07-03T18:20:22.163

this code is works on My scrapy version is 1.3.3 (save it in same directory in scrapy.cfg):

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

for scrapy 1.5.x (so you don't get the deprecation warning)

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spider_loader.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

I also used the answer given by Yuda Prawira above, which still works in Scrapy 1.5.2, but I got this warning: ```ScrapyDeprecationWarning: CrawlerRunner.spiders attribute is renamed to CrawlerRunner.spider_loader.``` All you have to do is change name in the for loop code: ```for spider in process.spider_loader.list(): ...``` Otherwise still works! — float13, Apr 22 '19 at 02:12

score 0 · Answer 5 · answered Sep 02 '21 at 17:56

0

Linux script

#!/bin/bash
for spider in $(scrapy list)
do
scrapy crawl "$spider" -o "$spider".json
done

answered Sep 02 '21 at 17:56

xaander1

1,064
2
12
40

score 0 · Answer 6 · answered Jan 05 '23 at 14:52

Running all spiders in project using python

# Run all spiders in project implemented using Scrapy 2.7.0

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


def main():
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    spiders_names = process.spider_loader.list()
    for s in spiders_names:
        process.crawl(s)
    process.start()


if __name__ == '__main__':
    main()

Locally run all of the spiders in Scrapy

6 Answers6

Running all spiders in project using python

Linked