3

I am trying to scrape data from a website using Selenium and phantomjs in python. However, this website is adding the data I'm interested in via javascript. Is there a way to ask Selnium to wait for the data before return it? So far, we've tried:

import contextlib                                                                
import selenium.webdriver as webdriver                                           
import selenium.webdriver.support.ui as ui

phantomjs = '/usr/local/bin/phantomjs'                                           
url = '[redacted]'             

with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    wait = ui.WebDriverWait(driver, 10)
    wait.until(lambda driver: driver.execute_script("return document.getElementById(\"myID\").innerText").startswith('[redacted]'))
    driver.execute_script("return document.getElementById(\"myID\").innerText")

Unfortunately, this code raises selenium.common.exceptions.TimeoutException: Message: None because the content of the id we're getting doesn't change.

We are using PhantomJS 1.9.7, python 2.7.5 in a virtualenv and selenium 2.41.0. Is it the right way to do this or are we missing something. Does anyone has a better method to do this?

Thanks in advance.

EDIT

Following @ExperimentsWithCode comment, we tried looping until the content is loaded:

with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    wait = ui.WebDriverWait(driver, 10)
    found = False
    while not found:
        try:
            wait.until(lambda driver: driver.execute_script("return document.getElementById(\"myID\").innerText").startswith('[redacted]'))
            driver.execute_script("return document.getElementById(\"myID\").innerText")
            found = True
        except:
             print "Not found"
             pass
  • 1
    Assuming this content is loading without your input, you can do a loop with a try statement. That way you can try to get that text. If the text is not loaded it will try again, and again, until the text is loaded. – ExperimentsWithCode Apr 28 '14 at 19:05
  • The text never seems to load, using the code in the edit. –  Apr 29 '14 at 08:17
  • There must be something that triggers the javascript or preventing it from triggering. Can you describe the element you are trying to interact with. Also, when you run this code, can you click, or interact with something that triggers the element to appear? What about if that try loop is commented out. Does the element just load in this case, or is it triggered by some interaction? Also try two things. – ExperimentsWithCode Apr 29 '14 at 15:12
  • The element I need to get is simply some text inside a div. The element appears by itself after some request according to the console when accessed from a regular browser. If the `try` is commented out, the code throws an exception. –  Apr 29 '14 at 15:42
  • Any info on what that request is? Also, the test was more to see if it appears without the try statement there. What error were you getting? Was it later in the code? If so, can you just comment out the rest of the code so the browser hangs and see if the element appears. Did you try clicking anything to see if that initiated the javascript? – ExperimentsWithCode Apr 29 '14 at 15:48
  • Also, this might be helpful. http://stackoverflow.com/questions/11018796/clicking-on-a-javascript-link-on-firefox-with-selenium – ExperimentsWithCode Apr 29 '14 at 16:19
  • The request is to a api twitter api endpoint. The error we are getting is the `selenium.common.exceptions.TimeoutException` because `wait.until` fails. I can't check if the browser hangs because we are using PhantomJS instead of a regular browser. –  Apr 29 '14 at 16:26
  • can you load the site manually, and try to figure out what is triggering the API. That is essentially the puzzle that needs solving at this point. – ExperimentsWithCode Apr 30 '14 at 14:37
  • The API call is triggered via javascript. Unfortunately, this call is somewhere in a vast codebase and we haven't been able to find it so far. This is why we are trying to scrape the website instead. –  Apr 30 '14 at 15:35

0 Answers0