How to parse HTML table against a list of variables using lxml?

Question

I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. For instance, if a <td> starts with 'Street 1', I then want to grab the <span> contents of that <td> tag. This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database.

lxml_parse.py

import lxml.html as lh

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows

test.htm

<tr>

    <td></td>

    <td colspan="2">

        Street 1:<span class="required"> *</span><br />

        <span class="boldred">2100 5th Ave</span>

    </td>

    <td colspan="2">

        Street 2:<br />

        <span class="boldred">Ste 202</span>

    </td>

</tr>

<tr>

    <td></td>

    <td>

        City:<span class="required"> *</span><br />

        <span class="boldred">NYC</span>

    </td>

    <td>

        State:<br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>

    </td>

    <td>

        Country:<span class="required"> *</span><br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>

    </td>

    <td>

        Zip:<br />

        <span class="boldred">10022</span>

    </td>

</tr>

Output :

$ python lxml_parse.py 
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']

Parse against a bunch of variables is what I am having problems with :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset

I am having problems with lxml syntax for : myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars) — ThinkCode, May 17 '12 at 19:51
Try `'//tr/td[child::*[text()='+var'+']]/span[@class="boldred"]/text()'`? Seems like you wanted the content of `var` in the XPath expression and not the string `var`. — dav1d, May 17 '12 at 20:48

score 1 · Answer 1 · answered May 17 '12 at 20:52

1

Aiming to produce this dictionary:

{'City:': 'NYC', 
 'Zip:': '10022', 
 'Street 1:': '2100 5th Ave', 
 'Country:': 'USA', 
 'State:': 'NY', 
 'Street 2:': 'Ste 202'}

You can use this code. And then it is easy to query the dictionary to get the values you desire:

import lxml.html as lh

test = '''<tr>
    <td></td>
    <td colspan="2">
        Street 1:<span class="required"> *</span><br />
        <span class="boldred">2100 5th Ave</span>
    </td>
    <td colspan="2">
        Street 2:<br />
        <span class="boldred">Ste 202</span>
    </td>
</tr>
<tr>
    <td></td>
    <td>
        City:<span class="required"> *</span><br />
        <span class="boldred">NYC</span>
    </td>
    <td>
        State:<br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
    </td>
    <td>
        Country:<span class="required"> *</span><br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
    </td>
    <td>
        Zip:<br />
        <span class="boldred">10022</span>
    </td>
</tr>'''

outhtml = lh.fromstring(test)
ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ]
vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')

result = dict( zip(ks,vs) )

print result

answered May 17 '12 at 20:52

daedalus

10,873
5
50
71

Thank you. While this works for my sample example, I was wondering if I can look for a preset variables since there are so many tables in this html and I am parsing it table by table against a preset keyvalues. Upvoting! Can you help me with the syntax here : outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()'))? – ThinkCode May 17 '12 at 20:59
Thanks for the upvote. This should work for similar table structures... Whatever the field name and the values you get in the resulting dictionary can be tested against the keys that you have in your lists of preset variables... Try to upload a couple of other examples? – daedalus May 17 '12 at 21:10
I just tested this on my main html and since the html is pure garbage, I got a whole lot of garbage in my actual test. Pastebin actually hit the max limit for free users when I tried pasting my HTML. Let me try to clean it. – ThinkCode May 17 '12 at 21:28
One could try and use BeautifulSoup as well if the html is malformed. – daedalus May 17 '12 at 21:29
I cleaned the HTML using lxml. The key value pairs are all over the place, the actual paths are messed up. I will try to get individual tables and parse. – ThinkCode May 17 '12 at 21:33
Yes, I understand. Nothing to do if the structure of the raw html changes in the source file. – daedalus May 17 '12 at 21:46

score 0 · Accepted Answer · answered May 21 '12 at 15:34

lxml_tempsofsol.py :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)

for each in myresultset:
    print each

Output :

$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')

score 0 · Answer 3 · answered Dec 06 '13 at 08:52

I've searched for the same thing and found your question and no "right" answer so I'll add a couple of points:

To refer to variables in XPath you should use $var syntax,
In lxml variables are passed as keyword arguments to xpath(),
Using child::* is wrong since you search for text directly within <td/>; text() already searches for text child nodes,
You need to use contains() XPath function due to whitespace.

Taking those into account your corrected code looks like this:

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = [(var, outhtml.xpath('//tr/td[contains(text(), $var)]/span[@class="boldred"]/text()', var=var)) for var in desiredvars]
print myresultset

How to parse HTML table against a list of variables using lxml?

3 Answers3

Linked