Scraping a website with content added via javascript with Selenium in Python

Question

I am trying to scrape data from a website using Selenium and phantomjs in python. However, this website is adding the data I'm interested in via javascript. Is there a way to ask Selnium to wait for the data before return it? So far, we've tried:

import contextlib                                                                
import selenium.webdriver as webdriver                                           
import selenium.webdriver.support.ui as ui

phantomjs = '/usr/local/bin/phantomjs'                                           
url = '[redacted]'             

with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    wait = ui.WebDriverWait(driver, 10)
    wait.until(lambda driver: driver.execute_script("return document.getElementById(\"myID\").innerText").startswith('[redacted]'))
    driver.execute_script("return document.getElementById(\"myID\").innerText")

Unfortunately, this code raises selenium.common.exceptions.TimeoutException: Message: None because the content of the id we're getting doesn't change.

We are using PhantomJS 1.9.7, python 2.7.5 in a virtualenv and selenium 2.41.0. Is it the right way to do this or are we missing something. Does anyone has a better method to do this?

Thanks in advance.

EDIT

Following @ExperimentsWithCode comment, we tried looping until the content is loaded:

with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    wait = ui.WebDriverWait(driver, 10)
    found = False
    while not found:
        try:
            wait.until(lambda driver: driver.execute_script("return document.getElementById(\"myID\").innerText").startswith('[redacted]'))
            driver.execute_script("return document.getElementById(\"myID\").innerText")
            found = True
        except:
             print "Not found"
             pass

Assuming this content is loading without your input, you can do a loop with a try statement. That way you can try to get that text. If the text is not loaded it will try again, and again, until the text is loaded. — ExperimentsWithCode, Apr 28 '14 at 19:05
There must be something that triggers the javascript or preventing it from triggering. Can you describe the element you are trying to interact with. Also, when you run this code, can you click, or interact with something that triggers the element to appear? What about if that try loop is commented out. Does the element just load in this case, or is it triggered by some interaction? Also try two things. — ExperimentsWithCode, Apr 29 '14 at 15:12
The element I need to get is simply some text inside a div. The element appears by itself after some request according to the console when accessed from a regular browser. If the `try` is commented out, the code throws an exception. — , Apr 29 '14 at 15:42
Any info on what that request is? Also, the test was more to see if it appears without the try statement there. What error were you getting? Was it later in the code? If so, can you just comment out the rest of the code so the browser hangs and see if the element appears. Did you try clicking anything to see if that initiated the javascript? — ExperimentsWithCode, Apr 29 '14 at 15:48
Also, this might be helpful. http://stackoverflow.com/questions/11018796/clicking-on-a-javascript-link-on-firefox-with-selenium — ExperimentsWithCode, Apr 29 '14 at 16:19
The request is to a api twitter api endpoint. The error we are getting is the `selenium.common.exceptions.TimeoutException` because `wait.until` fails. I can't check if the browser hangs because we are using PhantomJS instead of a regular browser. — , Apr 29 '14 at 16:26
can you load the site manually, and try to figure out what is triggering the API. That is essentially the puzzle that needs solving at this point. — ExperimentsWithCode, Apr 30 '14 at 14:37
The API call is triggered via javascript. Unfortunately, this call is somewhere in a vast codebase and we haven't been able to find it so far. This is why we are trying to scrape the website instead. — , Apr 30 '14 at 15:35

Scraping a website with content added via javascript with Selenium in Python

0 Answers0