1

I am looking for a way to cleanly convert HTML tables to readable plain text.

I.e. given an input:

<table>
    <tr>
        <td>Height:</td>
        <td>200</td>
    </tr>
    <tr>
        <td>Width:</td>
        <td>440</td>
    </tr>
</table>

I expect the output:

Height: 200
Width: 440

I would prefer not using external tools, e.g. w3m -dump file.html, because they are (1) platform-dependent, (2) I want to have some control over the process and (3) I assume it is doable with Python alone with or without extra modules.

I don't need any word-wrapping or adjustable cell separator width. Having tabs as cell separators would be good enough.

Update

This was an old question for an old use case. Given that pandas provides the read_html method, my current answer would definitely be pandas-based.

ccpizza
  • 28,968
  • 18
  • 162
  • 169

3 Answers3

4

How about using this:

Parse HTML table to Python list?

But, use collections.OrderedDict() instead of simple dictionary to preserve order. After you have a dictionary, it is very-very easy to get and format the text from it:

Using the solution of @Colt 45:

import xml.etree.ElementTree
import collections

s = """\
<table>
    <tr>
        <th>Height</th>
        <th>Width</th>
        <th>Depth</th>
    </tr>
    <tr>
        <td>10</td>
        <td>12</td>
        <td>5</td>
    </tr>
    <tr>
        <td>0</td>
        <td>3</td>
        <td>678</td>
    </tr>
    <tr>
        <td>5</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>
"""

table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    for key, value in collections.OrderedDict(zip(headers, values)).iteritems():
        print key, value

Output:

Height 10
Width 12
Depth 5
Height 0
Width 3
Depth 678
Height 5
Width 3
Depth 4
Community
  • 1
  • 1
Peter Varo
  • 11,726
  • 7
  • 55
  • 77
  • Thank you for example code but the issue is that it only handles one special case and my actual input is bit more complicated and contains lots of colspans so it won't display the data the way I want it to. Here is a sample of the actual data: http://pastebin.com/yRQvz2Ww At the moment none of the options I tried (elementree, lxml, BeautifulSoup) come close to the output of `w3m -dump` with the input i have. – ccpizza May 25 '13 at 11:53
  • That is a whole different question — I mean the *given input* and *expected output* is not what you asked. For what you asked first, my answer is working. – Peter Varo May 25 '13 at 12:07
  • My original example is *generic* and the preferred answer would ideally be *generic* too. The solution you propose does solve the simplest case but is not *generic* enough. – ccpizza May 25 '13 at 17:02
1

You should look at the standard library modules ElementTree and minidom

Oin
  • 6,951
  • 2
  • 31
  • 55
1

You can use HTQL module at http://htql.net.

Here is the sample code for your page:

import urllib2
url='http://pastebin.com/yRQvz2Ww'
page=urllib2.urlopen(url).read();

query="""<div (ID='super_frame')>1.<div (ID='monster_frame')>1.<div (ID='content_frame')>1.<div (ID='content_left')>1.<div (ID='code_frame2')>1.<div (ID='code_frame')>1.<div (ID='selectable')>1.<div (CLASS='html4strict')>1 &tx
<table>.<tr>{
    c1=<td>:colspan;   t1=<td>1 &tx; 
    c2=<td>2:colspan;   t2=<td>2 &tx;
    c3=<td>3:colspan;   t3=<td>3 &tx; 
    c4=<td>4:colspan;   t4=<td>4 &tx;
    c5=<td>5:colspan;   t5=<td>5 &tx;
}
"""

for t in htql.query(page, query): 
    print('\t'.join(t)); 

The htql.query() produces 10 columns including the c1, t2, c2, t2, ... c5, t5. You can use the c1..c5 information to know which cells the t1..t5 should be in.

seagulf
  • 380
  • 3
  • 5