1

I am making a Selenium-based spider for dynamic websites. But I want to stay within the Scrapy framework as this spider is part of a bigger project which utilizes all the spiders using the same workflow/commands.

The easiest thing to do is, to pass the requests from start_requests() to parse() and do all the Selenium stuff within parse().

However, this way, I will be making double requests to the website - once by Scrapy, and once by Selenium.

What I want is, to pass only the url to Selenium in parse(), download it, and parse further:

def start_requests(self):
    for url in self.start_urls:
        yield from self.parse(url)

It's the first thing that comes to mind, but seems Scrapy has limitations that start_requests() must yield eventually a Request type of object. And if I do so, I am getting errors (can specify them upon request).

So I came up with another idea: to use the original start_requests(), because a Request object is not supposed to download the page itself, and disable the download middleware which does so. However, even when disabling all middlewares:

custom_settings = {
    'DOWNLOADER_MIDDLEWARES' : {
    },
    'SPIDER_MIDDLEWARES': {
    },
    'DOWNLOAD_HANDLERS': {
    },
}

when I check the outgoing requests with ngrep, I can see that still, Scrapy is also downloading the remote url apart from Selenium, despite the custom settings which should have cut off the Downloader.

How do I download the urls only once via Selenium in this case?

Nikolay Shindarov
  • 1,616
  • 2
  • 18
  • 25
  • 1
    Possible duplicate of [selenium with scrapy for dynamic page](https://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page) – nyov Aug 31 '19 at 10:20

0 Answers0