3

I'm trying to scrape some urls with Scrapy and Selenium. Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.

The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.

I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable. I also tried to spawn another process in parse method. But seems that I don't have enought experience to make it right.

In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?

import time

import scrapy
from selenium.webdriver import Firefox


def load_with_selenium(url):
    with Firefox() as driver:
        driver.get(url)
        time.sleep(10)  # Do something
        page = driver.page_source
    return page


class TestSpider(scrapy.Spider):
    name = 'test_spider'

    tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
             {'start_url': 'https://www.nytimes.com/', 'selenium': True}]

    def start_requests(self):
        for task in self.tasks:
            yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)

    def parse(self, response):
        if response.meta['selenium']:
            response = response.replace(body=load_with_selenium(response.meta['start_url']))

        for url in response.xpath('//a/@href').getall():
            print(url)
  • Have you tried multiple threads? Processes can't share memory so some things will break. – kagronick Apr 13 '20 at 19:54
  • @kagronick, I have tried [running multiple spiders in the same process](https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process) (I passed different init parameters), but it led to exactly the same problem: Scrapy was waiting for the webdriver to finish its work. – Konstantin Ulianov Apr 13 '20 at 22:08

1 Answers1

1

It seems that I've found a solution.

I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.

I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.

So I split my tasks list between the processes. Then inside the crawl(tasks) method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.

import multiprocessing

import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
         {'start_url': 'https://www.nytimes.com/', 'selenium': True}]


def crawl(tasks):
    process = CrawlerProcess(get_project_settings())

    def run_spider(_, index=0):
        if index < len(tasks):
            deferred = process.crawl('test_spider', task=tasks[index])
            deferred.addCallback(run_spider, index + 1)
            return deferred

    run_spider(None)
    process.start()


def main():
    processes = 2
    with multiprocessing.Pool(processes) as pool:
        pool.map(crawl, np.array_split(tasks, processes))


if __name__ == '__main__':
    main()

The code of TestSpider in my question post must be modified accordingly to accept a task as an init parameter.

def __init__(self, task):
    scrapy.Spider.__init__(self)
    self.task = task

def start_requests(self):
    yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)