9

I'm doing webpage layout analysis in python. A fundamental task is to programmatically measure the elements' sizes given HTML source codes, so that we could obtain statistical data of content/ad ratio, ad block position, ad block size for the webpage corpus.

An obvious approach is to use the width/height attributes, but they're not always available. Besides, things like width: 50% needs to be calculated after loading into DOM. So I guess loading the HTML source code into a window-size-predefined-browser (like mechanize although I'm not sure if window's size could be set) is a good way to try, but mechanize doesn't support the return of an element size anyway.

Is there any universal way (without width/height attributes) to do it in python, preferably with some library?

Thanks!

shuaiyuancn
  • 2,744
  • 3
  • 24
  • 32
  • Man, I can't even get my elements to render to the same size in IE and Firefox. If there is an "official" way to calculate dimensions, you can bet that half the market ignores that and does it their own way. – Kevin Mar 27 '13 at 16:33
  • 1
    Just to point you into a direction -- might wanna look into what WebKit and the other renderers offer as output. Obviously won't get Trident, but WK / Gecko might be good enough... – TC1 Mar 27 '13 at 16:57
  • @Kevin Your concern is certainly valid. But for a (empirical) research purpose, I'll stick to any browser that could do this. I understand that in IE and Firefox some elements are not rendered as the same size and I've suffered, too. But is it really huge difference? I'm not worried about several pixels drift here :) – shuaiyuancn Mar 27 '13 at 16:57

2 Answers2

3

I suggest You to take a look at Ghost - webkit web client written in python. It has JavaScript support so you can easily call JavaScript functions and get its return value. Example shows how to find out google text box width:

>>> from ghost import Ghost
>>> ghost = Ghost()
>>> ghost.open('https://google.lt')
>>> width, resources = ghost.evaluate("document.getElementById('gbqfq').offsetWidth;")
>>> width
541.0  # google text box width 541px
Craig Anderson
  • 754
  • 1
  • 11
  • 18
Zygimantas Gatelis
  • 1,923
  • 2
  • 18
  • 27
0

To properly get all the final sizes, you need to render the contents, taking in account all CSS style sheets, and possibly all javascript. Therefore, the only ways to get the sizes from a Python program are to have a full web browser implementation in Python, use a library that can do so, or pilot a browser off-process, remotely.

The later approach can be done with use of the Selenium tools - check how you can get the result of javascript expressions from within a Python program here: Can Selenium web driver have access to javascript global variables?

Community
  • 1
  • 1
jsbueno
  • 99,910
  • 10
  • 151
  • 209