3

I have a situation where I have a CrawlSpider that searches for results using postal codes and categories (POST data). I need to get all the results for all the categories in all postal codes. My spider takes a postal code and a category as arguments for the POST data. I want to programmatically start a spider for each postal code/category combo via a script.

The documentation explains you can run multiple spiders per process with this code example here: http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process This is along the same thing that I want to do however I want to essentially queue up spiders to run one after the another after the preceding spider finishes.

Any ideas on how to accomplish this? There seems to be some answers that apply to older versions of scrapy (~0.13) but the architecture has changed and they no longer function with the latest stable (0.24.4)

aaearon
  • 170
  • 16

1 Answers1

1

You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case:

from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor

# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']


def configure_crawler(postal_code):
    spider = MySpider(postal_code)

    # configure signals
    crawler.signals.connect(callback, signal=signals.spider_closed)

    # detach spider
    crawler._spider = None

    # configure and start the crawler
    crawler.configure()
    crawler.crawl(spider)


# callback fired when the spider is closed
def callback(spider, reason):
    try:
        postal_code = postal_codes.pop()
        configure_crawler(postal_code)
    except IndexError:
        # stop the reactor if no postal codes left
        reactor.stop()


settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()

# start logging
log.start()

# start the reactor (blocks execution)
reactor.run()
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Great example. After the first spider is done and the callback is complete, when executing `crawler.crawl(spider)` I get the following the following AssertionError: `exceptions.AssertionError: Spider already attached` Not sure how to tackle this – aaearon Jan 19 '15 at 15:46
  • @aaearon thanks for trying it out! I've updated the code adding a line to detach the spider from the crawler, should help. Though, this is starting to be sort of a magic :) – alecxe Jan 19 '15 at 15:56
  • I found to get the above working as intended I had to move the initializing of the setting and crawler along with `crawler.start()` inside the `configure_crawler` function otherwise a second crawler would start but would leave off where the first crawler did not using the new data and would loop over the last url. – aaearon Jan 20 '15 at 08:44