3

I'm trying to figure out the python lxml api, but am running into a peculiar problem. I've installed the following library versions:

  • libxml2 : 2.7.8
  • libxslt : 1.1.26

When I run the following code:

html = open('file.html', 'r')
context = etree.iterparse(StringIO(html), events=("start", "end"), html='true')
for event, element in context:
    #do stuff

EDIT :

It turns out that it is a parsing error. I moved the html to a file(shown below)

<html>
    <head></head>
    <body>
        <table>
            <tr>
                <td>image</td>
                <a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
                <td>   35   </td>
                <td>   28   </td>
                <td><b>-7</b></td>
                <td>   
                23,000    </td>
                <td>   373,000   </td>
                <td>   644,000   </td>
                <td>+72.65%</td>
            </tr>
            <tr>
                <td>image</td>
                <td><a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
                <td>   35   </td>
                <td>   28   </td>
                <td><b>-7</b></td>
                <td>   
                23,000    </td>
                <td>   373,000   </td>
                <td>   644,000   </td>
                <td>+72.65%</td>
            </tr>
        </table>
    </body>
</html>

I'm now getting this error:

for event, element in context:

File "iterparse.pxi", line 515, in lxml.etree.iterparse.next (src/lxml/lxml.etree.c:86484) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084) lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 12

ORIGIN ERROR:

for event, element in context:

File "iterparse.pxi", line 515, in lxml.etree.iterparse.next (src/lxml/lxml.etree.c:86484) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084) lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ';', line 7, column 71

I thought I followed the tutorial from lxml's site pretty closely here so I'm very confused. Could it be an installation problem?

Justin Smith
  • 149
  • 1
  • 2
  • 9

2 Answers2

8

The problem is that the HTML is malformed. To solve this, you can use BeautifulSoup (it's able to parse this HTML) or sanitize the HTML before trying to parse it.

The problems I've found are:

  • Ampersand should be escaped as an HTML entity in links: & => &amp;
  • Closing td tag after first a tag has to be removed since it doesn't match any other opening td tag.
jcollado
  • 39,419
  • 8
  • 102
  • 133
  • Thanks. I was using BeautifulSoup and then I switched over to lxml because I read that it can deal with large files better since it supports iterative parsing. I now there's a BeautifulSoup interface with lxml so maybe I'll try looking there – Justin Smith Dec 31 '11 at 00:19
4

lxml iterparse can't parse broken html. If you have a really big file, or memory limitations, you can write your own parser like in this answer. But if you are allowed to store whole tree into memory, you can use lxml.html, which is faster than BeautifulSoup.

Community
  • 1
  • 1
reclosedev
  • 9,352
  • 34
  • 51