1

I am running a scraper created using Python 3.7, Selenium 3.141.0, Google Chrome 77.0.3865.90 and ChromeDriver 77.0.3865.40 on Ubuntu 18.04.

Problem: As the script continues scraping more pages, the memory usage continues to increase non-stop until it fills up both the RAM and swap. Hits 1-2 GB of memory usage after visiting 1000-2000 pages.

What is causing the memory usage to blow up? How can we fix this?

Also, do we run driver.quit() after downloading each page, or only right before the script exits?

Simplified Scraper Code

from selenium import webdriver
from utils import load_urls, process_html    # my helper functions

executable_path = '/usr/local/bin/chromedriver'
driver = get_driver(executable_path)
urls = load_urls('list_of_urls_to_scrape.json')

while urls:
    url = urls.pop()
    driver.get(url)
    process_html(driver.page_source)
driver.quit()

def get_driver(executable_path):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--start-maximized')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--incognito')
    driver = webdriver.Chrome(options=options, executable_path=executable_path)
    driver.set_script_timeout(300)
    driver.implicitly_wait(10)
    return driver
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
  • You could try opening each URL to scrape in a separate tab within a single driver, then close each tab after scraping, then close the driver at the end. If you can put up with the startup/shutdown time for the driver, you can close it after every X loops before you run out of memory. – mgrollins Oct 15 '19 at 00:29
  • @mgrollins Does repeatedly using `driver.get(url)` will only use a single tab for all the pages we are visiting? Or does it open a new tab everytime we call `driver.get`? – Nyxynyx Oct 15 '19 at 01:04
  • Is it crashing? If it's not crashing, then that sounds normal to me. Keep in mind that things like cache and history are growing. – pguardiario Oct 15 '19 at 05:26
  • Repeatedly using a single `driver.get(url)` will only use a single tab, unless you send specific command to open a new tab. Check out the discussion and answer on the SO below to see some discussion, however I'm not sure these methods are reliable. Might be more reliable to just close, delete, and recreate a single driver window. https://stackoverflow.com/questions/28431765/open-web-in-new-tab-selenium-python – mgrollins Oct 15 '19 at 17:14

0 Answers0