Improve scrapy and selenium with firefox in headless mode

Question

I am scraping a javascript heavy site and I have setup a vagrant instance to check the feasibility ( 1GB RAM ). The system crashes after parsing few urls. I am unable to determine memory requirements for this setup and reason for the crash. However I had htop running in parallel and got a screenshot before system crash, attached below. I suspect the memory is not sufficient , but I dont know how much I need. Therefore, I am looking for :

Memory requirements for my setup ( Scrapy + selenium + fireofox -headless
Reason of crash
How to improve the scraping process
Alternatives to either of ( scrapy , selenium , firefox)

SeleniumMiddleWare:

import os, traceback
from shutilwhich import which
from scrapy import signals
from scrapy.http import HtmlResponse
from scrapy.utils.project import get_project_settings
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options

SELENIUM_HEADLESS = False

settings = get_project_settings()

class SeleniumMiddleware(object):
    driver = None

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        if not request.meta.get('selenium'):
            return
        self.driver.get(request.url)

        #if setting new cookies remove old
        if len(request.cookies):
            self.driver.implicitly_wait(1)
            self.driver.delete_all_cookies()

        #add only desired cookies if session persistence is requested
        request_cookies=[]

        if request.meta.get('request_cookies'):
           request_cookies=request.meta.get('request_cookies')

        for cookie in request.cookies:
             if cookie['name'] in request_cookies:
                print ' ---- set request cookie [%s] ---- ' % cookie['name']
                new_cookie={k: cookie[k] for k in ('name', 'value','path', 'expiry') if k in cookie}
                self.driver.add_cookie(new_cookie)

        if  request.meta.get('redirect_url'):
            self.driver.get(request.meta.get('redirect_url'))
            self.driver.implicitly_wait(5)

        request.meta['driver'] = self.driver

        return HtmlResponse(self.driver.current_url, body=self.driver.page_source, encoding='utf-8', request=request)

    def spider_opened(self, spider):
        options=Options()
        binary= settings.get('SELENIUM_FIREFOX_BINARY') or which('firefox')
        SELENIUM_HEADLESS=settings.get('SELENIUM_HEADLESS') or False
        if SELENIUM_HEADLESS:
            print " ---- HEADLESS ----"
            options.add_argument( "--headless" )

        firefox_capabilities = DesiredCapabilities.FIREFOX
        firefox_capabilities['marionette'] = True
        firefox_capabilities['binary'] = binary
        try:
            self.driver = webdriver.Firefox(capabilities=firefox_capabilities, firefox_options=options)
        except Exception:
          print " ---- Unable to instantiate selenium webdriver instance ! ----"
          traceback.print_exc()
          os._exit(1)

    def spider_closed(self, spider):
        if self.driver:
            self.driver.close()

Please read why a [**`screenshot of HTML or code or error is a bad idea`**](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). Consider updating the Question with formatted text based relevant HTML, code trials and error stack trace. — undetected Selenium, Jan 25 '18 at 07:15
@DebanjanB I can get you debug logs from scrapy and any other log output — sakhunzai, Jan 25 '18 at 11:26
@DebanjanB, I have increased my RAM to 2GB and its working fine, probably that was the issue — sakhunzai, Jan 25 '18 at 12:58
Great News !!! Please add the solution as an Answer for future readers. If you can share your code trials I could have got to know the **Reason of crash** — undetected Selenium, Jan 25 '18 at 13:01
Yes, @sakhunzai would be nice to know what would cause this issue. — xmaestro, Jan 25 '18 at 13:35
@xmaestro I have added middleware code that is responsible for setting up selenium webdrive, I cannot share whole scrapper here — sakhunzai, Jan 25 '18 at 13:42

score 0 · Answer 1 · edited Jan 25 '18 at 21:16

A quick sneak in your code block reveals in def spider_closed(self, spider): you are using self.driver.close() as follows :

def spider_closed(self, spider):
    if self.driver:
        self.driver.close()

As per best practices to close the webdriver variant and the webbrowser instance you should invoke the quit() method within the tearDown() {}. Invoking quit(), deletes the current browsing session by sending the quit command with {"flags":["eForceQuit"]} and finally sends the GET request on /shutdown EndPoint.

You can find a detailed discussion in How to stop geckodriver process impacting PC memory, without calling driver.quit()?

So the solution would be to replace self.driver.close() with self.driver.quit() as follows :

def spider_closed(self, spider):
    if self.driver:
        self.driver.quit()

Improve scrapy and selenium with firefox in headless mode

SeleniumMiddleWare:

1 Answers1