13

So I've been working on scraper that goes on 10k+pages and scrapes data from it.

The issue is that over time, memory consumption raises drastically. So to overcome this - instead of closing driver instance only at the end of scrape - the scraper is updated so that it closes the instance after every page is loaded and data extracted.

But ram memory still gets populated for some reason.

I tried using PhantomJS but it doesn't load data properly for some reason. I also tried with the initial version of the scraper to limit cache in Firefox to 100mb, but that also did not work.

Note: I run tests with both chromedriver and firefox, and unfortunately I can't use libraries such as requests, mechanize, etc... instead of selenium.

Any help is appreciated since I've been trying to figure this out for a week now. Thanks.

iftheshoefritz
  • 5,829
  • 2
  • 34
  • 41
ScrapyNoob
  • 141
  • 1
  • 1
  • 5

5 Answers5

7

The only way to force the Python interpreter to release memory to the OS is to terminate the process. Therefore, use multiprocessing to spawn the selenium Firefox instance; the memory will be freed when the spawned process is terminated:

import multiprocessing as mp
import selenium.webdriver as webdriver

def worker()
    driver = webdriver.Firefox()
    # do memory-intensive work
    # closing and quitting is not what ultimately frees the memory, but it
    # is good to close the WebDriver session gracefully anyway.
    driver.close()
    driver.quit()

if __name__ == '__main__':
    p = mp.Process(target=worker)
    # run `worker` in a subprocess
    p.start()
    # make the main process wait for `worker` to end
    p.join()
    # all memory used by the subprocess will be freed to the OS

See also Why doesn't Python release the memory when I delete a large object?

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
2

Are you trying to say that your drivers are what's filling up your memory? How are you closing them? If you're extracting your data, do you still have references to some collection that's storing them in memory?

You mentioned that you were already running out of memory when you closed the driver instance at the end of scraping, which makes it seem like you're keeping extra references.

abrarisme
  • 495
  • 1
  • 6
  • 14
  • Yes, it seems like driver is filling memory up. I have 5 functions where Selenium is used. I use selenium alongside Scrapy. So in those functions I just instantiate new driver instance, then at the near end of function I call driver.quit() or driver.close(). As for keeping extra references, I'm not sure that I do. I use selenium for loading page, and once it loads I put page_source into Scrapy selector. I don't have any memory leaks in Scrapy. – ScrapyNoob Jul 03 '16 at 09:58
  • You can check for line-by-line memory usage (in your program not the websites) using [memory_profiler](https://pypi.python.org/pypi/memory_profiler). This should help in getting a better idea of what section is consuming your memory. If you can't find anything there, posting an example function here may be helpful. – abrarisme Jul 04 '16 at 03:34
  • @ScrapyNoob also check top to see if there are multiple instances of whatever browser you are using. – Lucas Azevedo Mar 07 '19 at 13:09
1

I have experienced similar issue and destroying that driver my self (i.e setting driver to None) prevent those memory leaks for me

1

I was having the same problem until putting the webdriver.get(url) statements inside a try/except/finally statement, and making sure webdriver.quit() was in the finally statement, this way, it always execute. Like:

webdriver = webdriver.Firefox()
try:
        webdriver.get(url)
        source_body = webdriver.page_source
except Exception as e:
        print(e)
finally:
        webdriver.quit()

From the docs:

The finally clause of such a statement can be used to specify cleanup code which does not handle the exception, but is executed whether an exception occurred or not in the preceding code.

Lucas Azevedo
  • 1,867
  • 22
  • 39
0

use this

os.system("taskkill /f /im chromedriver.exe /T")