12

I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2

But how do I schedule all spiders in a project at once?

All help much appreciated!

Zeugma
  • 31,231
  • 9
  • 69
  • 81
user1009453
  • 707
  • 2
  • 11
  • 28

2 Answers2

25

My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.

YOURPROJECTNAME/commands/allcrawl.py :

from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log

class AllCrawlCommand(ScrapyCommand):

    requires_project = True
    default_settings = {'LOG_ENABLED': False}

    def short_desc(self):
        return "Schedule a run for all available spiders"

    def run(self, args, opts):
        url = 'http://localhost:6800/schedule.json'
        for s in self.crawler.spiders.list():
            values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
            data = urllib.urlencode(values)
            req = urllib2.Request(url, data)
            response = urllib2.urlopen(req)
            log.msg(response)

Make sure to include the following in your settings.py

COMMANDS_MODULE = 'YOURPROJECTNAME.commands'

Then from the command line (in your project directory) you can simply type

scrapy allcrawl
Nabin
  • 11,216
  • 8
  • 63
  • 98
dru
  • 698
  • 1
  • 9
  • 11
  • Great, I Will try this first thing in the morning. I have no computer at the moment. Thank's for helping out! – user1009453 May 29 '12 at 19:20
  • Hi. I tried your solution but I receive the follwoing import error: import error: No module named commands I put the line "COMMANDS_MODULE = 'YOURPROJECTNAME.commands'" in the settings file in the project directory. Is this correct? – user1009453 May 30 '12 at 08:26
  • @user1009453 Make sure your commands folder has an `__init__.py` – dru May 30 '12 at 17:40
  • Thank's for staying with me on this, i really helps. Ok, I've included an __ini__.py file in the commands directory and uploaded th project again to scrapyd. But now I get the following error: NotImplementedError Do I have to implement/instantiate the command? – user1009453 May 31 '12 at 08:15
  • What is the structure inside schedule.json ? – user1787687 Aug 10 '13 at 18:21
  • 3
    Can anyone explain me, how multiple spiders crawl by this custom command? – Nabin Feb 12 '14 at 08:07
  • How do I pass argument? Eg: -a query="myquery" with this? – Yuda Prawira May 11 '17 at 20:38
1

Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.

From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.

  • If this was true when posted, it's not anymore. scrapyd runs jobs in parallel. If it seems it's not, check this: https://stackoverflow.com/questions/9161724/scrapy-s-scrapyd-too-slow-with-scheduling-spiders – brunobg Dec 23 '20 at 15:18