10

I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.

I have code which is supposed to poll document.readyState until it reaches "complete" or a 30s timeout has elapsed, and then proceed:

def readystate_complete(d):
    # AFAICT Selenium offers no better way to wait for the document to be loaded,
    # if one is in ignorance of its contents.
    return d.execute_script("return document.readyState") == "complete"

def load_page(driver, url):
    try:
        driver.get(url)
        WebDriverWait(driver, 30).until(readystate_complete)
    except WebDriverException:
        pass

    links = []
    try:
        for elt in driver.find_elements_by_xpath("//a[@href]"):
            try: links.append(elt.get_attribute("href"))
            except WebDriverException: pass
    except WebDriverException: pass
    return links

This sort-of works, but on about one page out of five, the .until call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.

What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?

Note: The obsessive catching-and-ignoring of WebDriverException has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).

NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • If you're using `WebDriverWait`, you're using Selenium 2, not Selenium RC. – Ross Patterson Sep 11 '13 at 08:39
  • @RossPatterson I was under the impression Selenium 2 and Selenium RC were the same thing, whereas Selenium IDE was the old QuicKeys-style thingy. Thanks for correction. – zwol Sep 11 '13 at 14:41
  • What did you do in the end? – KnewB Dec 05 '13 at 18:55
  • 1
    @KnewB I gave up. My code now sets a global one-minute timeout and then does `driver.get(url)` followed immediately by `driver.find_elements_by_xpath("//a[@href]")`. This does seem to wait for the page to load before reporting links. It still hangs forever from time to time, so I also wrote a watchdog process that kills and restarts the entire browser if it doesn't report any progress in five minutes. It triggers often enough to be a headache, but it is not worth my time to try to debug it any further. I still hope someone with more clue will come along here. – zwol Dec 05 '13 at 19:15
  • 1
    You can use pageLoadTimeOut() method. This takes the maximum time browser has to wait for the page to load. If page loads before the max time then script continues executing. If the page does not load after the max time you can catch an exception and you can close the browser. Hope this helps you. – Vinay Jan 14 '14 at 05:11
  • @Vinay That's what I am using for the "global one-minute timeout" I mentioned above. The current code is more reliable than what I originally posted but still hangs forever about one page load out of N (where N is somewhere between 20 and 100). – zwol Jan 14 '14 at 05:17
  • @Zack sorry am more easy with java. If it hangs at that point then can you catch the exception and solve it. – Vinay Jan 14 '14 at 05:22
  • @Vinay When the hang happens, no exception is thrown. – zwol Jan 14 '14 at 14:09
  • Have you checked to see if d.execute_script(...) is returning? Have you tried d.execute_async_script (with a reasonable timeout set)? – Talia Jul 21 '14 at 13:23
  • @Collin Because of this problem and several others, I have completely given up on Selenium, so no, I have not checked these things. I would be inclined to view `d.execute_script("return document.readyState")` failing to return as a critical-severity bug in Selenium and/or the browser itself. – zwol Jul 21 '14 at 15:27
  • I agree. Did you find a better automation tool? I'm having a lot of the same problems with selenium. – Talia Jul 24 '14 at 13:35
  • @Collin I am attempting to write my own: https://github.com/zackw/firefox-puppeteer I don't even know if it works for *me* yet (got recursively sidetracked) but improvements definitely welcome. – zwol Jul 24 '14 at 14:07

5 Answers5

5

I have a similar situation as I wrote the screenshot system using Selenium for a fairly well-known website service and had the same predicament: I could not know anything about the page being loaded.

After speaking with some of the Selenium developers, the answer was that various WebDriver implementations (Firefox Driver versus IEDriver for example) make different choices about when a page is considered to be loaded or not for the WebDriver to return control.

If you dig deep in Selenium code, you can find the spots that try and make the best choices, but since there are a number of things that can cause the state being looked for to fail, like multiple frames where one doesn't complete in a timely manner, there are cases where the driver obviously just does not return.

I was told, "it's an open-source project", and that it probably won't/can't be corrected for every possible scenario, but that I could make fixes and submit patches where applicable.

In the long run, that was a bit much for me to take on, so similar to you, I created my own timeout process. Since I use Java, I created a new Thread that upon reaching the timeout, tries to do several things to get WebDriver to return, even at times just pressing certain Keys to get the browser to respond has worked. If it does not return, then I kill the browser and try again as well.

Starting the driver again has handled most cases for us, as if the second load of the browser allowed it to be in a more settled state (mind you we are launching from VMs and the browser constantly wants to check for updates and run certain routines when it hasn't been launched recently).

Another piece of this is that we launch a known url first and confirm some aspects about the browser and that we are in fact able to interact with it before continuing. With these steps together the failure rate is pretty low, about 3% with 1000s of tests on all browsers/version/OSs (FF, IE, CHROME, Safari, Opera, iOS, Android, etc.)

Last but not least, for your case, it sounds like you only really need to capture the links on the page, not have full browser automation. There are other approaches I might take toward that, namesly cURL and linux tools.

Lukus
  • 1,038
  • 1
  • 11
  • 11
  • This is interesting. Just for the record, though, I am actually recording rather more than just the links and I need to use a setup that mimics "real browsing" as closely as possible in terms of network behavior, hence the use of FFDriver. – zwol Jan 14 '14 at 05:15
5
  1. The "recommended" (however still ugly) solution could be to use explicit wait:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions
    
    old_value = browser.find_element_by_id('thing-on-old-page').text
    browser.find_element_by_link_text('my link').click()
    WebDriverWait(browser, 3).until(
        expected_conditions.text_to_be_present_in_element(
            (By.ID, 'thing-on-new-page'),
            'expected new text'
        )
    )
    
  2. The naive attempt would be something like this:

    def wait_for(condition_function):
        start_time = time.time()
        while time.time() < start_time + 3:
            if condition_function():
                return True
            else:
                time.sleep(0.1)
        raise Exception(
            'Timeout waiting for {}'.format(condition_function.__name__)
        )
    
    
    def click_through_to_new_page(link_text):
        browser.find_element_by_link_text('my link').click()
    
        def page_has_loaded():
            page_state = browser.execute_script(
                'return document.readyState;'
            ) 
            return page_state == 'complete'
    
        wait_for(page_has_loaded)
    
  3. Another, better one would be (credits to @ThomasMarks):

    def click_through_to_new_page(link_text):
        link = browser.find_element_by_link_text('my link')
        link.click()
    
        def link_has_gone_stale():
            try:
                # poll the link with an arbitrary call
                link.find_elements_by_id('doesnt-matter') 
                return False
            except StaleElementReferenceException:
                return True
    
        wait_for(link_has_gone_stale)
    
  4. And the final example includes comparing page ids as below (which could be bulletproof):

    class wait_for_page_load(object):
    
        def __init__(self, browser):
            self.browser = browser
    
        def __enter__(self):
            self.old_page = self.browser.find_element_by_tag_name('html')
    
        def page_has_loaded(self):
            new_page = self.browser.find_element_by_tag_name('html')
            return new_page.id != self.old_page.id
    
        def __exit__(self, *_):
            wait_for(self.page_has_loaded)
    

    And now we can do:

    with wait_for_page_load(browser):
        browser.find_element_by_link_text('my link').click()
    

Above code samples are from Harry's blog.

kenorb
  • 155,785
  • 88
  • 678
  • 743
  • 1
    Unfortunately none of these are good enough for my use case: (1) is a non-starter because I do not know any 'thing-on-new-page', this has to work for *arbitrary* pages with contents unknown. (2) has the same problem as the code in my original question (works a lot of the time but sometimes hangs forever). (3, 4) will trigger _well before_ the point at which the page has actually loaded. – zwol May 25 '15 at 20:52
  • 1
    Thanks for trying, though! As I mentioned elsewhere, I gave up altogether on Selenium because this was just so intractable. – zwol May 25 '15 at 20:54
2

As far as i know, your readystate_complete is not doing anything as driver.get() is already checking for that condition. Anyway, i have seen it not working in many cases. One thing you could try is to route your traffic thru a proxy and use that for pinging for any network traffic. Ie browsermob has wait_for_traffic_to_stop method:

def wait_for_traffic_to_stop(self, quiet_period, timeout):
"""
Waits for the network to be quiet
:Args:
- quiet_period - number of seconds the network needs to be quiet for
- timeout - max number of seconds to wait
"""
    r = requests.put('%s/proxy/%s/wait' % (self.host, self.port),
        {'quietPeriodInMs': quiet_period, 'timeoutInMs': timeout})
    return r.status_code
Erki M.
  • 5,022
  • 1
  • 48
  • 74
  • 1
    You're quite right about the `readystate_complete` bit being unhelpful; as I mention above it started working somewhat better when I took that out. There is already a proxy for other reasons, so I will think about your suggestion. – zwol Jan 14 '14 at 05:12
1

Here is solution proposed by Tommy Beadle (by using staleness approach):

import contextlib
from selenium.webdriver import Remote
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of

class MyRemote(Remote):
    @contextlib.contextmanager
    def wait_for_page_load(self, timeout=30):
        old_page = self.find_element_by_tag_name('html')
        yield
        WebDriverWait(self, timeout).until(staleness_of(old_page))
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • ... and this has the same problem as your (3, 4) from your other answer: firing much too _early_, shortly after the old page is destroyed, which can happen before the browser has even finished parsing the HTML. – zwol May 25 '15 at 20:53
0

If the page is still loading indefinitely, I'm guessing the readyState never reaches "complete". If you're using Firefox, you can force the page loading to halt by calling window.stop():

try:
    driver.get(url)
    WebDriverWait(driver, 30).until(readystate_complete)
except TimeoutException:
    d.execute_script("window.stop();")
Joe Coder
  • 4,498
  • 31
  • 41
  • FYI, this is one of the suggestions from other variations on this question, that I mention I tried and it didn't work. Specifically, it does not prevent the hanging-forever phenomenon, although it *may* have made it less frequent. – zwol Dec 05 '13 at 19:18