2

So the issue is that I have one spider that crawls through a website, scraping a bunch of product information... Then I would like to have another which takes the list of product links built out in the first, and uses it to for checking purposes.

I realize I could just do this all in one spider, but the spider is already very large (is a generic spider for 25+ different domains), and would like to keep this as separated as possible. Currently I am creating instances of this master spider like follows:

def run_spiders(*urls, ajax=False):
    process = CrawlerProcess(get_project_settings())
    for url in urls:
        process.crawl(MasterSpider, start_page = url, ajax_rendered = ajax)
    process.start()

Ideally how this would work is something like as seen in the following:

I tried spawning another crawler process within the closed_handler of the MasterSpider, but the reactor is already running so clearly that isn't going to work. Any ideas?

Note that whenever i try to switch to a crawler runner, even if i go by what is exactly in the documentation/questions on here it doesn't exactly work. I'm thinking using the from_crawler might be my way to go, but i'm not entirely sure

1 Answers1

0

To chain multiple crawlers in a script you need to dive into Twisted deferred. Official docs have this as a potential solution.

According to that code snippet we can create something like this:

from twisted.internet import reactor, defer
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals

DATA = []

def store_item(*args, **kwargs):
    DATA.append(kwargs['item'])


class ProducerSpider(scrapy.Spider):
    name = 'spider1'
    start_urls = ['http://stackoverflow.com']

    def parse(self, response):
        yield {'url': response.url}


class ConsumerSpider(scrapy.Spider):
    name = 'spider2'

    def start_requests(self):
        for item in DATA:
            yield scrapy.Request(item['url'])

    def parse(self, response):
        yield {'url': response.url}


configure_logging()
runner = CrawlerRunner()


@defer.inlineCallbacks
def crawl():
    crawler1 = runner.create_crawler(ProducerSpider)
    crawler2 = runner.create_crawler(ConsumerSpider)
    crawler1.signals.connect(store_item, signals.item_scraped)
    crawler2.signals.connect(store_item, signals.item_scraped)
    yield runner.crawl(crawler1)
    yield runner.crawl(crawler2)
    reactor.stop()


crawl()
reactor.run()  # the script will block here until the last crawl call is finished
print('TOTAL RESULTS:')
print(DATA)

It's a bit long and hacky, but it's not as complicated as it looks:

  1. Create two spider classes and an accessible/global variable to communicate through called DATA
  2. Create CrawlerRunner instance to manage crawling.
  3. Create a crawler for each spider and chain them to our runner.
  4. Add signal to those our two crawlers to store their items to our shared variable DATA

So the first crawler crawls and puts every item through store_item function which just appends it to DATA. Then the second crawler starts and it's start_requests method reads directly from DATA to generate it's starting point.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82