3

I am having issue communicating between selenium and scrapy object.

I am using selenium to login to some site, once I get that response I want to use scrape's functionaries to parse and process. Please can some one help me writing middleware so that every request should go through selenium web driver and response should be pass to scrapy.

Thank you!

world
  • 51
  • 1
  • 3

1 Answers1

6

It's pretty straightforward, create a middleware with a webdriver and use process_request to intercept the request, discard it and use the url it had to pass it to your selenium webdriver:

from scrapy.http import HtmlResponse
from selenium import webdriver


class DownloaderMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome()  # your chosen driver

    def process_request(self, request, spider):
        # only process tagged request or delete this if you want all
        if not request.meta.get('selenium'):
            return
        self.driver.get(request.url)
        body = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=body)
        return response

The downside of this is that you have to get rid of the concurrency in your spider since selenium webdrive can only handle one url at a time. For that see settings documentation page.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • Hi Granitosaurus, thanks for the response. I would like to know what are the changes I need to do to seetting.py and what name should I give to this middleware and where I should save it in my project. thank you. – world Oct 27 '16 at 18:22
  • @world You can see how to activate a custom middleware [here](https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#activating-a-downloader-middleware) – Granitosaurus Oct 27 '16 at 19:45
  • 1
    It's disingenuous to say it's straight forward, as you're breaking a lot more than just just concurrency this way as you're bypassing the entire Downloader. Throttling, cookies, headers, proxy, and more are all not going to be set properly and Selenium will fetch with whatever it's default it. Furthermore, the response object won't have it's properties properly set, either, like `status`, and `headers`. – Rejected Oct 28 '16 at 20:57
  • @Rejected You can pull status and headers from webdriver as well. Unfortunately there is no straight-forward way to have concurrency with Selenium. Selenium in general is not fit to be used with scrapy in any way you look at it, but you can make it work pretty easily if you are willing to sacrifice some aspects like concurrency. Instead of Selenium you should use a rendering service like [Splash](https://github.com/scrapinghub/splash) which was designed to be used with scrapy by itself. – Granitosaurus Oct 29 '16 at 12:03
  • if you're passing a Selenium response (i.e. `browser.page_source`) to the `DownloaderMiddleware`, why do you have to re-instantiate the web driver in `__init__`? – oldboy Jun 28 '18 at 03:03