LXML's etree.tostring escaping urls in link href attributes

Question

When using LXML to parse an html document, and then using etree.tostring(), I am noticing that the ampersands in links are being converted to html-escaped entities.

This is breaking the link, for obvious reasons. Here is a simple self-contained example of the problem:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring("""<a href="https://www.example.com/?param1=value1&param2=value2">link</a>""", parser)
>>> etree.tostring(tree)
'<html><body><a href="https://www.example.com/?param1=value1&amp;param2=value2">link</a></body></html>'

I wish the output would be:

<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>

score 2 · Answer 1 · edited May 23 '17 at 11:48

Although & encoding is supposed to be the standard way. If you really need to avoid conversion for some reasons, then you can do:

Step 1. Find an unique string which shouldn't exist in your html source. You can simply use ANDamp; as your reserved_amp variable if you confident "ANDamp;" string will not appear in your html source. Otherwise you might consider to generate random alphabetic and check to ensure this string didn't exist in your html source:

>>> import random
>>> import string
>>> length = 15 #increase the length if it's still seems to be collide
>>> reserved_amp = "&amp;"
>>> html = """<a href="https://www.example.com/?param1=value1&param2=value2">link</a>"""
>>> while reserved_amp in [html, "&amp;"]: 
...     reserved_amp = ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(length)) + "amp;" #amp; is for you easy to spot on
... 
>>> print reserved_amp
2eya6oywxg5z7q5amp;

Step 2. replace all occurance of & before parse:

>>> html = html.replace("&", reserved_amp)
>>> html
'<a href="https://www.example.com/?param1=value12eya6oywxg5z7q5amp;param2=value2">link</a>'
>>>

Step 3. replace it back only if you need the original form:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> etree.tostring(tree).replace(reserved_amp, "&")
'<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>'
>>>

[UPDATE]:

The colon put at the end of reserved_amp is a safe guard.

What if we generated a reserved_amp like that ?

ampXampXampXampX + amp;

And html contains:

yyYampX&

It will encoded at this form:

yyYampXampXampXampXampXamp;

Still, it's not possible to return/decoded wrong reversed result something like yy&YampX (original is yyYampX&) due to the colon safe guard at the last character is a non-ASCII alphabetical which will never get generated as reserved_amp from string.ascii_lowercase + string.digits above.

So, ensure the random not using colon(or other non-ASCII character) and then append it at the end(MUST be the last character) will no need to worry about yyYampX& revert back to yy&YampX pitfall.

Patching lxml is better than this hack – Taha Jahangir Feb 13 '15 at 17:14 — Taha Jahangir, Feb 13 '15 at 17:14

score 0 · Answer 2 · answered Jun 19 '19 at 23:24

0

According to lxml's tostring() docs, method='xml' could be passed to avoid html's specifics

etree.tostring(tree, method='xml')

In my projects I use:

from lxml import html
html.tostring(node, with_tail=False, method='xml', encoding='unicode')

answered Jun 19 '19 at 23:24

MaxCore

2,438
4
25
43

1

This method didn't work for me. I needed to pretty-print an XML tree, but some elements had ` ` in their text. Using `stree.tostring(element, encoding='unicode', pretty_print=True, method='xml')` I got `&amp#160;` – Leonardo May 07 '20 at 15:04

LXML's etree.tostring escaping urls in link href attributes

2 Answers2