I using selenium Firefox webdriver to do some scraping like below:
def scrape():
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options)
browser.get(url) # some url
# do complex reading, element clicking and moving to and from pages
browser.quit()
time.sleep(5)
The above works well if its just one process (no parallel workers). Firefox memory consumption is stable and is cleared periodically after loading a lot of data during scraping.
However once I run the above function with joblib
Parallel
function. There seems to be some memory leak:
Parallel(n_jobs=-1)(delayed(scrape) for i in links)
Adding time.sleep()
after browser.quit()
seems to help slightly but not too much. And I notices that the smaller the no. of parallel jobs the less severe this leak is.
I have added more time.sleep()
functions throughout the code since it might be the case that the code is to computationally intensive and browser.quit()
is not releasing memory. But I still seem to have this problem compared to when I have just a single job.
Ultimately I want to release all firefox memory before a new job starts another webdriver
session. This does not seem to be happening. why is that? How can this be fixed?