When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. That is, if the actual attribute
reads this & that
, I get back this & that
.
Is there a way to get the unresolved attribute value? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1
from lxml import etree
s = '<html><body><a alt="hi & there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'
a.attrib.get('alt')
'hi & there'
etree.tostring(a)
'<a alt="hi & there">a link</a>'
I want to get the actual string hi & there
.