2

Using selenium through python on AWS ubuntu server to scrape dynamic pages with javascript (need to render full html). Finally got it working (thanks to unable to call firefox from selenium in python on AWS machine) for most websites, except some that will consistently give me a "Problem loading page".

In iPython:

from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(1024, 768))
display.start()
driver= webdriver.Firefox()
actions = webdriver.ActionChains(driver)

The following work fine (and respond quickly):

driver.get('http://www.apple.com/')
print driver.title
>> Apple

driver.get('http://www.orange.com/')
print driver.title
>> Orange.com: Corporate Website of Orange

But the following lags for 2-3 minutes and then finally returns with a problem loading page:

driver.get('http://www.trivago.com/')
print driver.title
>> Problem loading page

Here's some more info on the attributes of the driver at that point, in case it helps:

{'_is_remote': False,
 'binary': <selenium.webdriver.firefox.firefox_binary.FirefoxBinary at 0x296d590>,
 'capabilities': {u'acceptSslCerts': True,
  u'applicationCacheEnabled': True,
  u'browserConnectionEnabled': True,
  u'browserName': u'firefox',
  u'cssSelectorsEnabled': True,
  u'databaseEnabled': True,
  u'handlesAlerts': True,
  u'javascriptEnabled': True,
  u'locationContextEnabled': True,
  u'nativeEvents': True,
  u'platform': u'Linux',
  u'rotatable': False,
  u'takesScreenshot': True,
  u'version': u'26.0',
  u'webStorageEnabled': True},
 'command_executor':     <selenium.webdriver.firefox.extension_connection.ExtensionConnection at 0x296d6d0>,
 'error_handler': <selenium.webdriver.remote.errorhandler.ErrorHandler at 0x7f14f4d4bf50>,
 'profile': <selenium.webdriver.firefox.firefox_profile.FirefoxProfile at 0x2418cd0>,
 'session_id': u'ece53830-2b9d-4a32-b692-777602190d0c'}

The same urls all work well when I do the same code locally (through the terminal on my Mac).

Community
  • 1
  • 1
thorwhalen
  • 323
  • 1
  • 9
  • 1
    May be http://www.trivago.com/ inaccessible from your server? Try to get it using curl from server shell. – Dmitry Vakhrushev Jan 19 '14 at 18:58
  • I think you need to add the following property in your Display settings: `lmportal.xvfb.id=":99"` or `lmportal.xvfb.id=99` (though, it would not explain why you've managed to open one URL and failed to open another). – barak manos Jan 19 '14 at 19:05
  • @DmitryVakhrushev: Quick, short, and... spot on! Thanks a bunch. Curl failed indeed: ~$ curl www.trivago.com curl: (7) Failed connect to www.trivago.com:80; Connection timed out --> But now the question is: Why, and what to do about it? – thorwhalen Jan 20 '14 at 13:36
  • @DmitryVakhrushev: Sorry dimitry: trying to vote you up, but I'm too new to stackoverflow to do so it seems. barakmanos: Thanks for the tip. I'll keep that in mind if I have display problems. – thorwhalen Jan 20 '14 at 13:45
  • Solved: I managed to access the problematic urls using a different proxy, using `curl URL --proxy IP:PORT`. Thanks again @DmitryVakhrushev for the initial tip. – thorwhalen Jan 22 '14 at 12:50

0 Answers0