I'm trying to scrape some urls with Scrapy and Selenium. Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.
The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.
I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable
. I also tried to spawn another process in parse
method. But seems that I don't have enought experience to make it right.
In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?
import time
import scrapy
from selenium.webdriver import Firefox
def load_with_selenium(url):
with Firefox() as driver:
driver.get(url)
time.sleep(10) # Do something
page = driver.page_source
return page
class TestSpider(scrapy.Spider):
name = 'test_spider'
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def start_requests(self):
for task in self.tasks:
yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)
def parse(self, response):
if response.meta['selenium']:
response = response.replace(body=load_with_selenium(response.meta['start_url']))
for url in response.xpath('//a/@href').getall():
print(url)