Is there any way to prevent etree from resolving HTML entities when parsing HTML contents?
html = etree.HTML('<html><body>&</body></html>')
html.find('.//body').text
This gives me '&' but I want to get '&' itself.
Is there any way to prevent etree from resolving HTML entities when parsing HTML contents?
html = etree.HTML('<html><body>&</body></html>')
html.find('.//body').text
This gives me '&' but I want to get '&' itself.
You can always pre/post process your data. replace '&' with u'\xfe' before feeding to HTML parser and replace u'\xfe' with '&' when output.
from lxml import etree
html = etree.HTML('<html><body>&</body></html>'.replace('&',u'\xfe'))
html.find('.//body').text.replace(u'\xfe','&')
u'&'