I'm trying to use Python to parse HTML (although strictly speaking, the server claims it's xhtml) and every parser I have tried (ElementTree, minidom, and lxml) all fail. When I go to look at where the problem is, it's inside a script tag:
<script type="text/javascript">
... // some javascript code
if (condition1 && condition2) { // croaks on this line
I see what the problem is, the ampersand should be quoted. The problem is, this is inside a javascript script tag, so it cannot be quoted, because that would break the code.
What's going on here? How is inline javascript able to break my parse, and what can I do about it?
Update: per request, here is the code used with lxml.
>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22
The lxml manual starts Chapter 9 by stating "lxml provides a very simple and powerful API for parsing XML and HTML" so I would expect to not see that exception.