I'm using lxml.html for some html parsing in python. I'd like to get a rough estimate of the location of elements within the page after it would be rendered by a browser. It does not have to be exact, but generally correct. For simplicity I will ignore the effects of Javascript on element location. As an end result, I would like to be able to iterate over the elements (e.g., via lxml) and find their x/y coordinates. Any thoughts on how to do this? I don't need to stay with lxml and am happy to try other libraries.
Asked
Active
Viewed 2,546 times
4
-
4You will need a HTML rendering engine to get this information. A parser won't help. – Sven Marnach Dec 03 '10 at 11:56
-
You'll also need to consider the effect of CSS. Very little content is rendered without it, these days. – Marcelo Cantos Dec 03 '10 at 12:05
2 Answers
5
PyQt with webkit:
import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
class MyWebView(QWebView):
def __init__(self):
QWebView.__init__(self)
QObject.connect(self,SIGNAL('loadFinished(bool)'),self.showelements)
def showelements(self):
html=self.page().currentFrame().documentElement()
for link in html.findAll('a'):
print(link.toInnerXml(),str(link.geometry())[18:])
if __name__=='__main__':
app = QApplication(sys.argv)
web = MyWebView()
web.load(QUrl("http://www.google.com"))
web.show()
sys.exit(app.exec_())

Kabie
- 10,489
- 1
- 38
- 45
-
This is fantastic. Is there a way to get this to be a little more command-line friendly, specifically quitting on its own (or operating on sequence of urls? I have removed 'web.show()' and placed a 'sys.exit(0)' at the end of show elements. – muckabout Dec 05 '10 at 14:03
1
As stated by Sven, you need an HTML rendering engine. A question on rendering HTML was asked before, you could refer to that.

Community
- 1
- 1

Utku Zihnioglu
- 4,714
- 3
- 38
- 50