Run multiple scrapy spiders at once using scrapyd

Question

I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2

But how do I schedule all spiders in a project at once?

All help much appreciated!

score 25 · Accepted Answer · edited Feb 17 '17 at 06:43

25

My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.

YOURPROJECTNAME/commands/allcrawl.py :

from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log

class AllCrawlCommand(ScrapyCommand):

    requires_project = True
    default_settings = {'LOG_ENABLED': False}

    def short_desc(self):
        return "Schedule a run for all available spiders"

    def run(self, args, opts):
        url = 'http://localhost:6800/schedule.json'
        for s in self.crawler.spiders.list():
            values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
            data = urllib.urlencode(values)
            req = urllib2.Request(url, data)
            response = urllib2.urlopen(req)
            log.msg(response)

Make sure to include the following in your settings.py

COMMANDS_MODULE = 'YOURPROJECTNAME.commands'

Then from the command line (in your project directory) you can simply type

scrapy allcrawl

edited Feb 17 '17 at 06:43

Nabin

11,216
8
63
98

answered May 29 '12 at 18:02

dru

698
1
9
11

Great, I Will try this first thing in the morning. I have no computer at the moment. Thank's for helping out! – user1009453 May 29 '12 at 19:20
Hi. I tried your solution but I receive the follwoing import error: import error: No module named commands I put the line "COMMANDS_MODULE = 'YOURPROJECTNAME.commands'" in the settings file in the project directory. Is this correct? – user1009453 May 30 '12 at 08:26
@user1009453 Make sure your commands folder has an `__init__.py` – dru May 30 '12 at 17:40
Thank's for staying with me on this, i really helps. Ok, I've included an __ini__.py file in the commands directory and uploaded th project again to scrapyd. But now I get the following error: NotImplementedError Do I have to implement/instantiate the command? – user1009453 May 31 '12 at 08:15
What is the structure inside schedule.json ? – user1787687 Aug 10 '13 at 18:21
3

Can anyone explain me, how multiple spiders crawl by this custom command? – Nabin Feb 12 '14 at 08:07
How do I pass argument? Eg: -a query="myquery" with this? – Yuda Prawira May 11 '17 at 20:38

Cajetan Rodrigues · Answer 2 · 2014-11-30T03:59:09.383

1

Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.

From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.

edited Nov 30 '14 at 03:59

answered Nov 29 '14 at 19:40

Cajetan Rodrigues

172
7

If this was true when posted, it's not anymore. scrapyd runs jobs in parallel. If it seems it's not, check this: https://stackoverflow.com/questions/9161724/scrapy-s-scrapyd-too-slow-with-scheduling-spiders – brunobg Dec 23 '20 at 15:18

Run multiple scrapy spiders at once using scrapyd

2 Answers2

Linked