I am running a scraper created using Python 3.7, Selenium 3.141.0, Google Chrome 77.0.3865.90 and ChromeDriver 77.0.3865.40 on Ubuntu 18.04.
Problem: As the script continues scraping more pages, the memory usage continues to increase non-stop until it fills up both the RAM and swap. Hits 1-2 GB of memory usage after visiting 1000-2000 pages.
What is causing the memory usage to blow up? How can we fix this?
Also, do we run driver.quit()
after downloading each page, or only right before the script exits?
Simplified Scraper Code
from selenium import webdriver
from utils import load_urls, process_html # my helper functions
executable_path = '/usr/local/bin/chromedriver'
driver = get_driver(executable_path)
urls = load_urls('list_of_urls_to_scrape.json')
while urls:
url = urls.pop()
driver.get(url)
process_html(driver.page_source)
driver.quit()
def get_driver(executable_path):
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--start-maximized')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--incognito')
driver = webdriver.Chrome(options=options, executable_path=executable_path)
driver.set_script_timeout(300)
driver.implicitly_wait(10)
return driver