4

I use Selenium and Firefox webdriver with python to scrape data from a website.

But in the code, I need to access this website more than 10k times and it consumes a lot of RAM to do that.

Usually, when the script access this site 2500 times, it already consumes 4gb or more of RAM and it stops to work.

Is it possible to reduce memory RAM consumption without close browser session?

I ask that because when I start the script, I need to log manually on the site(two-factor autentication, the code is not shown below) and if I close the browser session, I will need to log in the site again.

for itemLista in lista:
    driver.get("https://mytest.site.com/query/option?opt="+str(itemLista))

    isActivated = driver.find_element_by_xpath('//div/table//tr[2]//td[1]')
    activationDate = driver.find_element_by_xpath('//div/table//tr[2]//td[2]')

    print(str(isActivated.text))
    print(str(activationDate.text))

    indice+=1
    print("numero: "+str(indice))

    file2.write(itemLista+" "+str(isActivated.text)+" "+str(activationDate.text)+"\n")

#close file
file2.close()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
fabiobh
  • 705
  • 2
  • 13
  • 33
  • Maybe instead of keeping `file2` open, only open it and write it once per iteration? It seems like the culprit is the growing size of `file2` in your buffer. – r.ook Jan 03 '19 at 19:15
  • Did you consider _Headless Firefox_ or _PhantomJS_ or _HTMLUnit_ browsers as an option? – undetected Selenium Jan 03 '19 at 19:16
  • @ DebanjanB I think to use a headless browser it is not an option for me, because when I access the site, I need to put a password on it. Because the site is protected by a two-factor password that I receive on my email each time that I try to access. – fabiobh Jan 03 '19 at 19:44
  • I'm curious, can you get performance graphs from your OS? One that'll track the browser's process, and your script; it'll help you nail down which is causing the memory usage hike. (I'm mostly curious cause I'd love the see the browser's one :D, its behavior during 10k navigations is very interesting.) – Todor Minakov Jan 04 '19 at 05:51
  • You could implement a browser recycle option - every X itterations to close the browser and the webdriver, and open them again, thus getting their memory footprint to baseline. X can be 100, 500, 2000 - whatever turns up most useful for you (this "recycle" is an expensive operation, time-wise). This though should be done if only the mem leak turns out to be in the browser, not in your script. – Todor Minakov Jan 04 '19 at 05:55
  • 1
    use Chrome it use less memory. – ewwink Jan 04 '19 at 06:50

3 Answers3

2

I discover how to avoid the memory leak.

I just use

time.sleep(2)

after

file2.write(itemLista+" "+str(isActivated.text)+" "+str(activationDate.text)+"\n")

Now firefox is working without consumes lots of RAM

It is just perfect.

I don't know exactly why it stopped consumes so much memory, but I think it was growing memory consume because it didn't have time to finish each driver.get request.

fabiobh
  • 705
  • 2
  • 13
  • 33
1

As mentioned in my comment, only open and write to your file on each iteration instead of keeping it open in memory:

# remove the line file2 = open(...) from your code

for itemLista in lista:
    driver.get("https://mytest.site.com/query/option?opt="+str(itemLista))

    isActivated = driver.find_element_by_xpath('//div/table//tr[2]//td[1]')
    activationDate = driver.find_element_by_xpath('//div/table//tr[2]//td[2]')

    print(str(isActivated.text))
    print(str(activationDate.text))

    indice+=1
    print("numero: "+str(indice))

    with open("your file path here", "w") as file2:
        file2.write(itemLista+" "+str(isActivated.text)+" "+str(activationDate.text)+"\n")

While selenium is quite a memory hungry beast, it doesn't necessarily murder your RAM with each growing iteration. However your growing opened buffer of file2 does take up RAM the more you write to it. Only when it's closed it will release the virtual memory and write the physical.

r.ook
  • 13,466
  • 2
  • 22
  • 39
  • I made the change that you suggest, but unfortunately it didn't fix my problem. The firefox browser stills consume a lot of RAM. I think even if the [code]file2[/code] is still open, it didn't affect too much the RAM usage. – fabiobh Jan 03 '19 at 20:25
  • Odd. Is your driver creating a new instance of the browser each time? On Chrome I notice that in the processes it creates a new chrome.exe but is quickly killed off and the RAM usage is in check. Not sure how it functions for the firefox driver. If anything I'd guess the RAM rises at each `driver.get()`... if that is so you could consider create a new driver and close it each iteration, but it's probably more time consuming. – r.ook Jan 03 '19 at 21:13
  • This solution will not decrease the memory usage - python does not hold in-memory the file content that has been written up until now; it has a buffer for what is pending writing - the only thing that's in the RAM re: your concern, and ths buffer's default size in most OS is just a single line. In fact this approach leads to bad/unexpected result - in every itteration the file is re-created with just the last line; e.g. now the script will not store all values, but just the last. – Todor Minakov Jan 04 '19 at 05:46
  • @Todor Minakov You are right, only the last line is saved. – fabiobh Jan 04 '19 at 11:21
  • @Idlehands When I see the task manager, it only has a single firefox process. I can't create a new driver because I will lose browser session and I need the browser session because I need to put a two-factor password manually each time that I try to run the script. I will try to use Chrome driver to see if I find any difference. – fabiobh Jan 04 '19 at 11:25
  • Don't be overly-optimistic for Chrome - it's going to have the same - if not bigger - mem footprint. Better see what actually is using that much memory; could be your script itself. – Todor Minakov Jan 04 '19 at 11:32
  • 2
    @TodorMinakov Good point, I shouldn't have used `w` mode. I thought the buffer would take up virtual memory but maybe I'm mistaken. Thanks for pointing it out. – r.ook Jan 04 '19 at 14:51
1

It is not clear from your question about the list items within lista to check the actual url/website.

However, it may not be possible to reduce RAM consumption while accessing the website more than 10k times in a row with the approach you have adapted.

Solution

As you mentioned when the script access this site 2500 times or so, it already consumes 4gb or more of RAM and it stops to work you may induce a counter to access the site 2000 times in a loop and reinitialize the WebDriver and Web Browser afresh after invoking driver.quit() within tearDown(){} method to close & destroy the existing WebDriver and Web Client instances gracefully as follows:

driver.quit() // Python

You can find a detailed discussion in PhantomJS web driver stays in memory

Incase the GeckoDriver and Firefox processes are still not destroyed and removed you may require to kill the processes from tasklist.

  • Python Solution(Cross Platform):

    import os
    import psutil
    
    PROCNAME = "geckodriver" # or chromedriver or iedriverserver
    for proc in psutil.process_iter():
        # check whether the process name matches
        if proc.name() == PROCNAME:
            proc.kill()
    

You can find a detailed discussion in Selenium : How to stop geckodriver process impacting PC memory, without calling driver.quit()?

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Unfortunately I can't use driver.quit() because it will destroy the web session. I need the web session because when I run the script I need to manually inut a two-factor password. If I set the script to use driver.quit() after 2000 times, when I restart the driver I will need to put the password again. But just like you said, I think there is not other solution to this problem. – fabiobh Jan 04 '19 at 14:51
  • In that case you can invoke `driver.close()` and forcefully kill the **Firefox** browser instances. – undetected Selenium Jan 04 '19 at 14:53