1

I read the answers to Parse HTML table to Python list? and tried to use the ideas to read/process my local html downloaded from a web site
(the files contain one table and start with the <table class="table"> label). I ran into problems due to the presence of two html tags.

With the <thead> label the parse doesn't pick up the header, and the <tbody> causes both xml and lxml to completely fail.

I tried googling for a solution but the answer most likely is embedded in some documentation somewhere for xml and/or lxml.

I'm just trying to plug into xml or lxml in the simplest way possible, but would be happy if the community here pointed the way to other 'stable/trusted' modules that might be more appropriate.

I realized I could edit the strings in python to remove the tags, but that is not too elegant, and I'm trying to learn new things.

Here is the stripped down sample code illustrating the problem:

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Parse HTML table to list
#--------*---------*---------*---------*---------*---------*---------*---------*
import os, sys
from xml.etree import ElementTree as ET
from lxml import etree


#                  # this setting blows up

s     = """<table class="table">
<thead>
<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>
</thead>
<tbody>
<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>
</tbody>
</table>
"""

#                  # open this up for clear sailing
if False:
    s     = """<table class="table">

<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>


<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>

</table>
"""

s = s.replace('\n','')
print('0:\n'+s)

while True:
    table = ET.XML(s)
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('1:')
        print(values)
    break

while True:
    table = etree.HTML(s).find("body/table")
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('2:')
        print(values)
    break

sys.exit()
CopyPasteIt
  • 532
  • 1
  • 8
  • 22

1 Answers1

0

While waiting for some help showing how to do this in a 'Pythonic way', I came up with an easy brute force method:

With the string s set to the 2nd option, with the given <thead> and <tbody> labels, apply the following code:

s = ''.join(s.split('<tbody>'))
s = ''.join(s.split('</tbody>'))
s = ''.join(s.split('<thead>'))
s = ''.join(s.split('</thead>'))
CopyPasteIt
  • 532
  • 1
  • 8
  • 22