I am making a Selenium-based spider for dynamic websites. But I want to stay within the Scrapy framework as this spider is part of a bigger project which utilizes all the spiders using the same workflow/commands.
The easiest thing to do is, to pass the requests from start_requests()
to parse()
and do all the Selenium stuff within parse()
.
However, this way, I will be making double requests to the website - once by Scrapy, and once by Selenium.
What I want is, to pass only the url to Selenium in parse()
, download it, and parse further:
def start_requests(self):
for url in self.start_urls:
yield from self.parse(url)
It's the first thing that comes to mind, but seems Scrapy has limitations that start_requests()
must yield eventually a Request
type of object. And if I do so, I am getting errors (can specify them upon request).
So I came up with another idea: to use the original start_requests()
, because a Request
object is not supposed to download the page itself, and disable the download middleware which does so. However, even when disabling all middlewares:
custom_settings = {
'DOWNLOADER_MIDDLEWARES' : {
},
'SPIDER_MIDDLEWARES': {
},
'DOWNLOAD_HANDLERS': {
},
}
when I check the outgoing requests with ngrep
, I can see that still, Scrapy is also downloading the remote url apart from Selenium, despite the custom settings which should have cut off the Downloader.
How do I download the urls only once via Selenium in this case?