Executing a page's JavaScript at a low level with Python?

Question

When this page is scraped with urllib2:

url = https://www.geckoboard.com/careers/
response = urllib2.urlopen(url)
content = response.read()

the following element (the link to the job) is nowhere to be found in the source (content)

Taking a look at the full source that gets rendered in a browser:

So it would appear that the FRONT-END ENGINEER element is dynamically loaded by Javascript. Is it possible to have this Javascript executed by urllib2 (or other low-level library) without involving e.g. Selenium, BeautifulSoup, or other?

No, unless you can hunt around enough in the page JS to figure out where that link is coming from. — xrisk, Feb 09 '16 at 16:18

score 1 · Answer 1 · answered Feb 09 '16 at 16:25

The pieces of information are loaded using some ajax request. You could use firebug extension for mozilla or google chrome has it's own tool to get theese details. Just hit f12 in google chrome while opening the URL. You can find the complete details there.

There you will find a request with url https://app.recruiterbox.com/widget/13587/openings/

Information from the above url is rendered in that web page.

score 1 · Answer 2 · edited May 23 '17 at 12:19

From what I understand, you are building something generic for multiple web-sites and don't want to go deep down in how a certain site is loaded, what requests are made under-the-hood to construct the page. In this case, a real browser is your friend - load the page in a real browser automated via selenium - then, once the page is loaded, pass the .page_source to lxml.html (from what I see this is your HTML parser of choice) for further parsing.

If you don't want a browser to show up or you don't have a display, you can go headless - PhantomJS or a regular browser on a virtual display.

Here is a sample code to get you started:

from lxml.html import fromstring
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_page_load_timeout(15)
driver.get("https://www.geckoboard.com/careers/")

# TODO: you might need a delay here

tree = fromstring(driver.page_source)

driver.close()

# TODO: parse HTML

You should also know that, there are plenty of methods to locate elements in selenium and you might not even need a separate HTML parser here.

score 0 · Answer 3 · answered Aug 09 '16 at 18:50

0

I think you're looking for something like this: https://github.com/scrapinghub/splash

answered Aug 09 '16 at 18:50

Samuel Ellis

157
2
5

Executing a page's JavaScript at a low level with Python?

3 Answers3