3

Is there any way to prevent etree from resolving HTML entities when parsing HTML contents?

html = etree.HTML('<html><body>&amp;</body></html>')
html.find('.//body').text

This gives me '&' but I want to get '&amp;' itself.

mzjn
  • 48,958
  • 13
  • 128
  • 248
Jonghwan Hyeon
  • 423
  • 4
  • 10
  • 2
    One option/workaround is to process the body text with `cgi.escape`, see http://stackoverflow.com/questions/1061697/whats-the-easiest-way-to-escape-html-in-python. – alecxe Mar 08 '14 at 01:33

1 Answers1

1

You can always pre/post process your data. replace '&' with u'\xfe' before feeding to HTML parser and replace u'\xfe' with '&' when output.

from lxml import etree
html = etree.HTML('<html><body>&amp;</body></html>'.replace('&',u'\xfe'))
html.find('.//body').text.replace(u'\xfe','&')
u'&amp;'
dev007
  • 11
  • 3