2

I'm trying to make it easy for users to input numbers from a web page. The easiest thing I can imagine would be for them to provide a url and an xpath associated with that number. My code could then go grab the numbers. The concept of an xpath isn't well-known (to non-coders), but it's trivial to find an xpath using Chrome's Inspect and Developer tools. So that's great.

The problem is that xpaths from Chrome and Firefox won't always get you a working xpath for use in an html parser as explained here: Why does this xpath fail using lxml in python?

Basically, browsers will change the source into a more technically correct form and then they will show this changed form to the user and base their xpaths on that form.

This problem could be repaired if there were an automatic way for your code to see not the page source, but Chrome's rendition of the page source. Is there an efficient, automatic way to do this?

One more time, more succinctly and exactly: how would I give python the altered HTML document that Chrome produces rather than the original source document to parse?

Community
  • 1
  • 1
Doug Bradshaw
  • 1,452
  • 1
  • 16
  • 20

2 Answers2

1

The only way I see is to actually run a web engine...

With QtWebKit QWebFrame you can use setHtml, and toHtml will return the source code adapted by WebKit...

Obviously this is a big dependency, but just installing PySide will get you everything that's needed.


So this turned out to be a lot dirtier than I expected, at least the part that's needed to isolate Qt from other code. Using setHtml doesn't seem to let you use toHtml immediately; some asynchronous loading must happen...

It would probably make a lot more sense to look for some simpler WebKit bindings.

So, load_source both downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.

setUrl here can be replaced with setHtml, if you want to do the download separately.

from PySide.QtCore import QObject, QUrl, Slot
from PySide.QtGui import QApplication
from PySide.QtWebKit import QWebPage, QWebSettings

qapp = QApplication([])

def load_source(url):
    page = QWebPage()
    page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
    page.mainFrame().setUrl(QUrl(url))

    class State(QObject):
        src = None
        finished = False

        @Slot()
        def loaded(self, success=True):
            self.finished = True
            if self.src is None:
                self.src = page.mainFrame().toHtml()
    state = State()

    # Optional; reacts to DOM ready, which happens before a full load
    def js():
        page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
        page.mainFrame().evaluateJavaScript('''
            document.addEventListener('DOMContentLoaded', qstate$.loaded);
        ''')
    page.mainFrame().javaScriptWindowObjectCleared.connect(js)

    page.mainFrame().loadFinished.connect(state.loaded)

    while not state.finished:
        qapp.processEvents()

    return state.src

Demonstration using the example from the linked question. Now it actually works...

from lxml import html

url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

src = load_source(url)

tree = html.fromstring(src)
text = tree.xpath(xpath)
Oleh Prypin
  • 33,184
  • 10
  • 89
  • 99
  • This worked for my test-case as well. It's much faster than loading the full browser. Installing Pysides was quite a workout for my cpu but it's well documented and the code you placed here works nicely. – Doug Bradshaw Dec 10 '14 at 17:17
1

Use Selenium. https://selenium-python.readthedocs.org

from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://example.com')
html_source = browser.page_source

Than you can parse html_source (Chrome browser source) with lxml.

allcaps
  • 10,945
  • 1
  • 33
  • 54
  • This looks like an even bigger dependency... Browser, Java, Selenium, Python-Selenium bindings. And external browser seems much less reliable. – Oleh Prypin Dec 10 '14 at 10:18
  • You are right. Not light, and depending on a real browser is a weak spot. But it does answer the Q. It gives Python the HTML that *Chrome* produces. I know Webkit is the underlying engine. But not every WebKit-based browser behaves exactly the same way. When you need the exact Chrome output, use Chrome to output it. – allcaps Dec 10 '14 at 10:56
  • This answer is correct, and works for my test case. It's also very slow and requires the full popup of the browser. – Doug Bradshaw Dec 10 '14 at 17:15