2

I am trying to use lxml to validate a piece of HTML but it complains that the fragment is invalid even though it should be valid:

img = """<img src="http://api.com/?data=ey&ip=1&img=1" height="1" width="1">"""
parser = lxml.etree.HTMLParser(recover=False)
lxml.etree.parse(StringIO(img), parser)

raises:

XMLSyntaxError: htmlParseEntityRef: expecting ';', line 1, column 37

Changing the & separating the parts of the query string to ; seems to fix the issue but that should not be required. Using semicolons is a recommendation of the W3C.

Is there something I can do to get lxml to see this fragment as valid?

TylerH
  • 20,799
  • 66
  • 75
  • 101
Alex Rothberg
  • 10,243
  • 13
  • 60
  • 120

1 Answers1

0

I can’t test it with lxml, but I guess that you have to escape the ampersands as &amp;:

<img src="http://api.com/?data=ey&amp;ip=1&amp;img=1" height="1" width="1">
unor
  • 92,415
  • 26
  • 211
  • 360
  • I actually think what lxml _requires_ per the W3C _recommendation_ is ``. – Alex Rothberg Mar 27 '15 at 17:49
  • @AlexRothberg: Why should this be the case? There is no W3C Recommendation that requires `;` in URIs. You are free to build your URIs according to [the URI standard](http://tools.ietf.org/html/std66). In fact, by default HTML GET forms use the `&` for separating name-value pairs in the query component. Your example URI is fine; you just have to escape `&` if used in HTML attributes like `href`. – unor Mar 28 '15 at 01:22
  • The specific recommendation Alex mentions is here: https://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2 – Quentin Jul 10 '23 at 13:48