3

I'm trying to use Python to parse HTML (although strictly speaking, the server claims it's xhtml) and every parser I have tried (ElementTree, minidom, and lxml) all fail. When I go to look at where the problem is, it's inside a script tag:

<script type="text/javascript">
... // some javascript code
    if (condition1 && condition2) { // croaks on this line

I see what the problem is, the ampersand should be quoted. The problem is, this is inside a javascript script tag, so it cannot be quoted, because that would break the code.

What's going on here? How is inline javascript able to break my parse, and what can I do about it?

Update: per request, here is the code used with lxml.

>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22

The lxml manual starts Chapter 9 by stating "lxml provides a very simple and powerful API for parsing XML and HTML" so I would expect to not see that exception.

Michael
  • 9,060
  • 14
  • 61
  • 123
  • Could you please add a code sample and tell what happens and what you'd expect to happen? – Schnouki Oct 08 '14 at 22:20
  • 1
    Maybe there should be an obligatory "don't parse html with an xml parser" rant along the same lines as [this](http://stackoverflow.com/a/1732454/748858). Try an html parser like beautifulsoup. . . – mgilson Oct 08 '14 at 22:21
  • @mgilson my understanding from reading the lxml manual is that it (supposedly) can parse HTML. – Michael Oct 08 '14 at 22:22
  • @Michael -- That's true. lxml has a [dedicated html parser](http://lxml.de/lxmlhtml.html). You could try that... (And, FWIW, you probably know more about this than I do -- I've never done much HTMO parsing :-) – mgilson Oct 08 '14 at 22:24

1 Answers1

5

There are a lot of really crappy ways for HTML parsing to break. Bad HTML is ubiquitous, and both script sections and various templating languages throw monkey wrenches into the works.

But, you also seem to be using XML-oriented parsers for the job, which are stricter and thus much, much more likely to break if not presented with exactly-right, totally valid input. Which most HTML--including most XHTML--manifestly is not.

So, use a parser designed to overlook some of the HTML gotchas:

import lxml.html 
d = lxml.html.parse(URL)

That should take you off to the races.

Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77
  • indeed it does, thanks! (although why my other attempt with `lxml.html.etree.parse(URL)` didn't work is beyond me...) – Michael Oct 08 '14 at 22:37
  • `lxml` is an amazing library, but it has some very fiddly bits, and not everything has parallel structure. What imports you need is, IMO, one of the places you need to get things *just so*. – Jonathan Eunice Oct 08 '14 at 23:18