Parsing HTML table with LXML in Python

Question

I need to parse html table of the following structure:

<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
 <tbody>
   <tr width="620">
     <th width="620">Smth1</th>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth2</td>
     ...
   </tr>
   <tr bgcolor="E4E4E4" width="620">
     <td width="620">Smth3</td>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth4</td>
     ...
   </tr>
 </tbody>
</table>

Python code:

r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

But I get this on the third line:

IndexError: list index out of range

The task is to form python dict from this. Number of rows could be different.

UPD. Changed the way I'm getting html code to avoid possible problems with requests lib. Now it's a simple url:

html = lxml.html.parse(test_url)

This proves everyting is Ok with html:

lxml.html.open_in_browser(html)

But still the same problem:

rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

Here is the xpath1:

'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'

UPD2. It was found experimentally, that xpath crashes on:

xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []

If xpath1 is shorter, then it seeem to work well and returns [<Element table at 0x2cbadb0>] for xpath1 = '/html/body/table'

Pro tip: please include the *full* traceback of python errors to reduce the need guess for anyone helping you. — Martijn Pieters, Jan 17 '13 at 22:46

score 5 · Accepted Answer · answered Jan 18 '13 at 00:20

You didn't include the XPath, so I'm not sure what you're trying to do, but if I understood correctly, this should work

xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

This is making a list of one item lists though, like this:

[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]

To have a simple list of the values, you can use this code

xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)

This is all assuming that r.text is exactly what you posted up there.

Described all changes in UPD, but the problem is still there — Anatoly Maltsev, Jan 18 '13 at 09:11

score 0 · Answer 2 · answered Jan 17 '13 at 22:45

0

Your .xpath(xpath1) XPath expression failed to find any elements. Check that expression for errors.

answered Jan 17 '13 at 22:45

Martijn Pieters

1,048,767
296
4,058
3,343

Included XPath1 into description, checked it one more time with FireBug – Anatoly Maltsev Jan 18 '13 at 09:13
run `print html.xpath(xpath1)` to test, not in FireBug. – Martijn Pieters Jan 18 '13 at 09:14
Described the situation in UPD2 – Anatoly Maltsev Jan 18 '13 at 09:42

Parsing HTML table with LXML in Python

2 Answers2