how to get unresolved entities from html attributes using python and lxml

Question

When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. That is, if the actual attribute reads this & that, I get back this & that.

Is there a way to get the unresolved attribute value? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

I want to get the actual string hi & there.

http://stackoverflow.com/questions/1061697/whats-the-easiest-way-to-escape-html-in-python — Will, May 04 '15 at 19:47
what I would like is a way to get the text unaltered by lxml; cgi.escape will escape by replacing ampersands with entities (for example), but even if it was unescape (replacing entities with ampersands), what I want is the actual text as it exists in the generally unknown HTML source. — Tim, May 04 '15 at 20:15
You'll need to build a custom parser then. Perhaps you can inherit the HTMLParser and override the parsing of the textual bits you want. — Will, May 05 '15 at 08:05

score 2 · Accepted Answer · answered May 05 '15 at 01:15

Unescaped character is invalid in HTML, and HTML abstraction model (lxml.etree in this case) only works with valid HTML. So there is no notion of unescaped character after the source HTML loaded to the object model.

Given unescaped characters in HTML source, parser will either fails completely, or tries to fix the source automatically. lxml.etree.HTMLParser seems to fall to the latter category. For demo :

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

And I believe, the HTML tree model doesn't retain information regarding the original HTML source, it retains the fixed-valid one instead. So at this point, we can only see that all characters are escaped.

Having said that, how about using cgi.escape() to get escaped entities! :p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

har07 and @Will, thanks--I did not know that the restriction on unescaped chars applied to attributes as well as content. I see what you're both saying and I will rethink my original problem. cgi.escape seems like the only way to answer my question. — Tim, May 05 '15 at 12:57
You can still build your own parser. Just inherit the standard one and overload the methods you need with some cgi.escape voodoo. — Will, May 06 '15 at 17:43

how to get unresolved entities from html attributes using python and lxml

1 Answers1