I am scraping forum post titles using the Firefox gecko driver with selenium in Python and have hit a snag that I can't seem to figure out.
~$ geckodriver --version
geckodriver 0.19.0
The source code of this program is available from
testing/geckodriver in https://hg.mozilla.org/mozilla-central.
This program is subject to the terms of the Mozilla Public License 2.0.
You can obtain a copy of the license at https://mozilla.org/MPL/2.0/.
I am trying to scrape a couple years worth of past post titles from the forum and my code works fine for a while. I've sat and watched it run for about 20-30 minutes and it does exactly what it is supposed to be doing. However then I kick the script off, and go to bed, and when I wake up the next morning I find that it's processed ~22,000 posts. The site I'm currently scraping has 25 posts per page, so it got through ~880 separate URL's before it crashes.
When it does crash it throws the following error:
WebDriverException: Message: Tried to run command without establishing a connection
Initially my code looked like this:
FirefoxProfile = webdriver.FirefoxProfile('/home/me/jupyter-notebooks/FirefoxProfile/')
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
driver.get(url)
### code to process page here ###
driver.close()
I've also tried:
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
driver.get(url)
### code to process page here ###
driver.close()
and
for url in urls:
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
driver.get(url)
### code to process page here ###
driver.close()
I get the same error in all 3 scenerios, but only after it's been running successfully for quite a while, and I'm not sure how to determine why it's failing.
How do I determine why I get this error after it's successfully processed several hundred url's? Or is there some sort of best practice I'm not following with Selenium/Firefox for processing this many pages?