4

When using LXML to parse an html document, and then using etree.tostring(), I am noticing that the ampersands in links are being converted to html-escaped entities.

This is breaking the link, for obvious reasons. Here is a simple self-contained example of the problem:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring("""<a href="https://www.example.com/?param1=value1&param2=value2">link</a>""", parser)
>>> etree.tostring(tree)
'<html><body><a href="https://www.example.com/?param1=value1&amp;param2=value2">link</a></body></html>'

I wish the output would be:

<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>
user3942918
  • 25,539
  • 11
  • 55
  • 67
Kevin Dolan
  • 4,952
  • 3
  • 35
  • 47

2 Answers2

2

Although & encoding is supposed to be the standard way. If you really need to avoid conversion for some reasons, then you can do:

Step 1. Find an unique string which shouldn't exist in your html source. You can simply use ANDamp; as your reserved_amp variable if you confident "ANDamp;" string will not appear in your html source. Otherwise you might consider to generate random alphabetic and check to ensure this string didn't exist in your html source:

>>> import random
>>> import string
>>> length = 15 #increase the length if it's still seems to be collide
>>> reserved_amp = "&amp;"
>>> html = """<a href="https://www.example.com/?param1=value1&param2=value2">link</a>"""
>>> while reserved_amp in [html, "&amp;"]: 
...     reserved_amp = ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(length)) + "amp;" #amp; is for you easy to spot on
... 
>>> print reserved_amp
2eya6oywxg5z7q5amp;

Step 2. replace all occurance of & before parse:

>>> html = html.replace("&", reserved_amp)
>>> html
'<a href="https://www.example.com/?param1=value12eya6oywxg5z7q5amp;param2=value2">link</a>'
>>> 

Step 3. replace it back only if you need the original form:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> etree.tostring(tree).replace(reserved_amp, "&")
'<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>'
>>> 

[UPDATE]:

The colon put at the end of reserved_amp is a safe guard.

What if we generated a reserved_amp like that ?

ampXampXampXampX + amp;

And html contains:

yyYampX&

It will encoded at this form:

yyYampXampXampXampXampXamp;

Still, it's not possible to return/decoded wrong reversed result something like yy&YampX (original is yyYampX&) due to the colon safe guard at the last character is a non-ASCII alphabetical which will never get generated as reserved_amp from string.ascii_lowercase + string.digits above.

So, ensure the random not using colon(or other non-ASCII character) and then append it at the end(MUST be the last character) will no need to worry about yyYampX& revert back to yy&YampX pitfall.

Community
  • 1
  • 1
林果皞
  • 7,539
  • 3
  • 55
  • 70
0

According to lxml's tostring() docs, method='xml' could be passed to avoid html's specifics

etree.tostring(tree, method='xml')

In my projects I use:

from lxml import html
html.tostring(node, with_tail=False, method='xml', encoding='unicode')
MaxCore
  • 2,438
  • 4
  • 25
  • 43
  • 1
    This method didn't work for me. I needed to pretty-print an XML tree, but some elements had ` ` in their text. Using `stree.tostring(element, encoding='unicode', pretty_print=True, method='xml')` I got `&amp#160;` – Leonardo May 07 '20 at 15:04