lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ';'

Question

I'm trying to figure out the python lxml api, but am running into a peculiar problem. I've installed the following library versions:

libxml2 : 2.7.8
libxslt : 1.1.26

When I run the following code:

html = open('file.html', 'r')
context = etree.iterparse(StringIO(html), events=("start", "end"), html='true')
for event, element in context:
    #do stuff

EDIT :

It turns out that it is a parsing error. I moved the html to a file(shown below)

<html>
    <head></head>
    <body>
        <table>
            <tr>
                <td>image</td>
                <a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
                <td>   35   </td>
                <td>   28   </td>
                <td><b>-7</b></td>
                <td>   
                23,000    </td>
                <td>   373,000   </td>
                <td>   644,000   </td>
                <td>+72.65%</td>
            </tr>
            <tr>
                <td>image</td>
                <td><a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
                <td>   35   </td>
                <td>   28   </td>
                <td><b>-7</b></td>
                <td>   
                23,000    </td>
                <td>   373,000   </td>
                <td>   644,000   </td>
                <td>+72.65%</td>
            </tr>
        </table>
    </body>
</html>

I'm now getting this error:

for event, element in context:

File "iterparse.pxi", line 515, in lxml.etree.iterparse.next (src/lxml/lxml.etree.c:86484) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084) lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 12

ORIGIN ERROR:

for event, element in context:

File "iterparse.pxi", line 515, in lxml.etree.iterparse.next (src/lxml/lxml.etree.c:86484) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084) lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ';', line 7, column 71

I thought I followed the tutorial from lxml's site pretty closely here so I'm very confused. Could it be an installation problem?

Is that the actual HTML you're parsing? What happens if you build the same HTML with the E builder? — jeffknupp, Dec 29 '11 at 06:06
that's not the actual html, I replaced it with '...' for brevity — Justin Smith, Dec 29 '11 at 06:08
It looks be failing on a real parse error. If you have trivial HTML, does it succeed? — jeffknupp, Dec 29 '11 at 06:14
you were right. Trivial HTML does go through. I've updated the question accordingly. This doesn't even strike me as poorly constructed HTML though? — Justin Smith, Dec 29 '11 at 06:41

score 8 · Accepted Answer · answered Dec 29 '11 at 09:08

8

The problem is that the HTML is malformed. To solve this, you can use BeautifulSoup (it's able to parse this HTML) or sanitize the HTML before trying to parse it.

The problems I've found are:

Ampersand should be escaped as an HTML entity in links: & => &
Closing td tag after first a tag has to be removed since it doesn't match any other opening td tag.

answered Dec 29 '11 at 09:08

jcollado

39,419
8
102
133

Thanks. I was using BeautifulSoup and then I switched over to lxml because I read that it can deal with large files better since it supports iterative parsing. I now there's a BeautifulSoup interface with lxml so maybe I'll try looking there – Justin Smith Dec 31 '11 at 00:19

score 4 · Answer 2 · edited May 23 '17 at 12:00

4

lxml iterparse can't parse broken html. If you have a really big file, or memory limitations, you can write your own parser like in this answer. But if you are allowed to store whole tree into memory, you can use lxml.html, which is faster than BeautifulSoup.

edited May 23 '17 at 12:00

Community

1
1

answered Jan 02 '12 at 07:16

reclosedev

9,352
34
51

lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ';'

EDIT :

2 Answers2