1
urllib2.urlopen("http://www.someURL.com/pageTracker.html").read();

The code above will return the source HTML at http://www.google.com.

What do I need to do to actually return the rendered HTML that you see when you visit google.com? I essentially trying to 'execute' a URL to trigger a view, not retrieve the HTML.

To clarify a few things:

  • I'm not actually concerned about the visual output of the page
  • I'm concerned about the page rendering as it would inside of a proper browser so that I can track a Google Analytics goal via the JavaScript on that page.
Ryan Martin
  • 1,613
  • 3
  • 24
  • 36
  • 1
    You might need to put this HTML in a rendering library http://stackoverflow.com/questions/126131/python-library-for-rendering-html-and-javascript – Vincent Audebert Dec 16 '13 at 23:15

2 Answers2

1

Because Google home page somewhat relies on JavaScript, you cannot get rendered HTML with a simple HTTP request / HTML parsing library, as these do not run the JavaScript enhancements on the page. Only web browsers render HTML, so you need a browser to get the rendered HTML.

Instead of simple HTTP request library, you need to use a full-blown headless web browser library.

One available option is Selenium and its WebDriver.

https://pypi.python.org/pypi/selenium

  1. Open a page in Selenium. See PyPi for the example.

  2. Wait some time with time.sleep() to make sure all resource are loaded and JavaScript-based DOM modifications settle. The delay depends on the web page, I suggest you experiement with different values.

  3. You can issue a JavaScript command to the Selenium driver to return the DOM tree of currently loaded page:

    driver.execute_script("return document.innerHTML")
    
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • Thanks for the feedback. I was hoping for a simpler solution, I'm literally just trying to execute a page view. – Ryan Martin Dec 16 '13 at 23:59
  • The modern Selenium is quite simple to use, so after the initial adoption you might start to like it :) It is also possible to do AJAX-scraping with it. – Mikko Ohtamaa Dec 17 '13 at 08:22
0

You might want to try https://code.google.com/p/pywebkitgtk/. Using PyWebkit you can create a rendered view of the HTML page.

Rendering a web page is not an trivial task as web technology is changing constantly. Several rendering engines exist. Two of them are the most prominent: Webkit (Chrome/Chromium, Safari) and Gecko (Firefox). Also there is Trident (Internet Explorer) and Blink (Opera).

Google.com also contains Javascript which needs to be interpreted. It should render fine without Javascript, but probably will look differently.

mxm
  • 605
  • 5
  • 13
  • Thanks for the feedback. I'm actually not concerned about the visual result of the page. I just need the page to execute as it normally would if it's inside of a browser so that the JavaScript on the page can trigger a Google Analytics goal. – Ryan Martin Dec 17 '13 at 00:01