4

I'm using lxml.html for some html parsing in python. I'd like to get a rough estimate of the location of elements within the page after it would be rendered by a browser. It does not have to be exact, but generally correct. For simplicity I will ignore the effects of Javascript on element location. As an end result, I would like to be able to iterate over the elements (e.g., via lxml) and find their x/y coordinates. Any thoughts on how to do this? I don't need to stay with lxml and am happy to try other libraries.

muckabout
  • 1,923
  • 1
  • 19
  • 31

2 Answers2

5

PyQt with webkit:

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class MyWebView(QWebView):
    def __init__(self):
        QWebView.__init__(self)
        QObject.connect(self,SIGNAL('loadFinished(bool)'),self.showelements)

    def showelements(self):
        html=self.page().currentFrame().documentElement()
        for link in html.findAll('a'):
            print(link.toInnerXml(),str(link.geometry())[18:])


if __name__=='__main__':
    app = QApplication(sys.argv)

    web = MyWebView()
    web.load(QUrl("http://www.google.com"))
    web.show()

    sys.exit(app.exec_())
Kabie
  • 10,489
  • 1
  • 38
  • 45
  • This is fantastic. Is there a way to get this to be a little more command-line friendly, specifically quitting on its own (or operating on sequence of urls? I have removed 'web.show()' and placed a 'sys.exit(0)' at the end of show elements. – muckabout Dec 05 '10 at 14:03
1

As stated by Sven, you need an HTML rendering engine. A question on rendering HTML was asked before, you could refer to that.

Python library for rendering HTML and javascript

Community
  • 1
  • 1
Utku Zihnioglu
  • 4,714
  • 3
  • 38
  • 50