Extract data from HTML in PHP or Python

Question

I need to extract this data and display a simple graph out of it.

Something like Equity Share Capital -> array (30.36, 17, 17 .... etc) would help.

<html:tr>
<html:td>Equity Share Capital</html:td>
<html:td class="numericalColumn">30.36</html:td>
<html:td class="numericalColumn">17.17</html:td>
<html:td class="numericalColumn">15.22</html:td>
<html:td class="numericalColumn">9.82</html:td>
<html:td class="numericalColumn">9.82</html:td>
</html:tr>

How do I go about this task in PHP or Python?

You should really reproduce part of the file you have posted on _this_ site so this question can be used by others in the future! — Hooked, Dec 19 '10 at 21:00
possible duplicate of [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Dec 19 '10 at 21:01
Do you mean I should add some sample HTML to this page ? SO seems to display my html code as HTML in brower and not as code — Nishant, Dec 19 '10 at 21:10
@ Nishant, you should put the relevant portion of the HTML file posted above (namely the section about Equity share capital). Use the SO _code_ formating to leave the data untouched. — Hooked, Dec 19 '10 at 21:13

score 5 · Accepted Answer · edited Dec 19 '10 at 22:45

5

A good place to start looking would be the python module BeautifulSoup which extracts the text and places it into a table.

Assuming you've loaded the data into a variable called raw:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(raw)

for x in soup.findAll("html:td"):
   if x.string == "Equity share capital":
       VALS = [y.string for y in x.parent.findAll() if y.has_key("class")]

print VALS

This gives:

[u'30.36', u'17.17', u'15.22', u'9.82', u'9.82']

Which you'll note is a list of unicode strings, make sure to convert them to whatever type you desire before processing.

There are many ways to do this via BeautifulSoup. The nice thing I've found however is the quick hack is often good enough (TM) to get the job done!

edited Dec 19 '10 at 22:45

jfs

399,953
195
994
1,670

answered Dec 19 '10 at 20:45

Hooked

84,485
43
192
261

BeautifulSoup did come up in google search , I will check more . Would appreciate if anyone can give a simple solution also :) – Nishant Dec 19 '10 at 20:47
5

If you ask for the solution, you won't learn about using it yourself. Sure, someone will give you the solution as it happens, but trying yourself is the best way to learn, in my humble and perhaps worthless opinion :-) – user225312 Dec 19 '10 at 20:49
I have often found problems like this solved in couple of lines of smart code , but not able to find any . The problem looks very trivial , but yeah learning comes only with experimenting :) – Nishant Dec 19 '10 at 20:51
Thanks a lot , Hooked - thats really a commenadble job from you – Nishant Dec 19 '10 at 21:03

score 2 · Answer 2 · answered Dec 19 '10 at 20:46

2

BeautifulSoup

answered Dec 19 '10 at 20:46

user225312

126,773
69
172
181

score 2 · Answer 3 · answered Dec 19 '10 at 21:03

2

Don't forget lxml in Python. It also works well to extract data. It's harder to install but faster. http://pypi.python.org/pypi/lxml/2.2.8

answered Dec 19 '10 at 21:03

Lennart Regebro

167,292
41
224
251

My first task was to tidy up the HTML to good XHTML . Lxml didnt not help much and the inbuilt clean function didnt do the job well enough for me . However parsing abilities needs to explored in lxml . Will check both the options . – Nishant Dec 19 '10 at 21:07
@Nishant: If you only need to parse it you don't need to clean it first, but yeah, lxml's strength is probably not cleaning. – Lennart Regebro Dec 19 '10 at 21:33

Extract data from HTML in PHP or Python

3 Answers3