I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.
I have code which is supposed to poll document.readyState
until it reaches "complete" or a 30s timeout has elapsed, and then proceed:
def readystate_complete(d):
# AFAICT Selenium offers no better way to wait for the document to be loaded,
# if one is in ignorance of its contents.
return d.execute_script("return document.readyState") == "complete"
def load_page(driver, url):
try:
driver.get(url)
WebDriverWait(driver, 30).until(readystate_complete)
except WebDriverException:
pass
links = []
try:
for elt in driver.find_elements_by_xpath("//a[@href]"):
try: links.append(elt.get_attribute("href"))
except WebDriverException: pass
except WebDriverException: pass
return links
This sort-of works, but on about one page out of five, the .until
call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.
What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?
Note: The obsessive catching-and-ignoring of WebDriverException
has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).
NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.