Using lxml changes to in a line for some reason

Question

When using the following code one line in the file is being changed for some reason

dpa_tree = etree.parse(dpaFile)
dpa_root = dpa_tree.getroot()
dpa_tree.write(dpaFile, encoding='UTF-8', xml_declaration=True, method='xml', standalone=True)

In the original line, the &#xA towards the end of the line is being changed to &#10 for some reason. How do I prevent this change from occurring?

The orginal line

<Setting Value="rO0ABXNyAGpjb20udmVjdG9yLmNmZy5nZW4uY29yZS5nZW5jb3JlLmludGVybmFsLmFvdi5BdXRv&#xA;bWF0a....

changes to

<Setting Value="rO0ABXNyAGpjb20udmVjdG9yLmNmZy5nZW4uY29yZS5nZW5jb3JlLmludGVybmFsLmFvdi5BdXRv&#10;bWF0a....

(the ... at the end of the lines is just to indicate I have not posted the entire line.)

` ` and ` ` are [numeric character references](https://en.wikipedia.org/wiki/Numeric_character_reference). Why lxml chooses one over the other I don't know, but they are equivalent. Both represent the line feed character. — mzjn, Mar 18 '22 at 08:10

buddemat · Answer 1 · 2022-03-22T15:59:18.180

0

Both sequences are equivalent. They are both HTML encoded versions of the Line Feed character. In your original file, the hexadecimal representation (
) is used, while the lxml output uses the decimal representation (
).

So while there seems to be a difference, both are actually representations of the same character (see Why HTML decimal and HTML hex? for some info on why there are different representations to begin with).

If you want to force the hexadecimal representation for some reason, you can use one of the options method='c14n' or method='c14n2' to serialize the element tree to canonical XML.

dpa_tree.write(dpaFile, method='c14n')

Please note: using the canonical methods is not compatible with adding the options to output an XML declaration (xml_declaration=True) or specifying an encoding (encoding='UTF-8').

However, as the W3C notes:

The XML declaration, including version number and character encoding is omitted from the canonical form. The encoding is not needed since the canonical form is encoded in UTF-8. The version is not needed since the absence of a version number unambiguously indicates XML 1.0.

edited Mar 22 '22 at 15:59

answered Mar 17 '22 at 21:09

buddemat

4,552
14
29
49

Due to the limitations of c14n and c14n2 of not being able to use encoding = utf-8 and no xml header record is there a way to override the tree.write to leave the hex data instead of converting it to decimal? – Twowolfs Mar 21 '22 at 16:59
If you don't mind me asking: Why is it important to you that the XML be serialized in hexadecimal representation? Does an application somehow depend on it? Also, even though you cannot specify the options, the effect will be the same, see [W3C](https://www.w3.org/TR/xml-c14n11/#NoXMLDecl): *The XML declaration, including version number and character encoding is omitted from the canonical form. The encoding is not needed since the canonical form is encoded in UTF-8. The version is not needed since the absence of a version number unambiguously indicates XML 1.0.* – buddemat Mar 22 '22 at 15:57
We are using some third party tools and we modify some of there arxml files. The customer supplying the tool review these changes and ask us to fix them becuase they would not guarantee the difference wouldn't impact there tool. – Twowolfs Mar 22 '22 at 19:33

Using lxml changes to in a line for some reason

1 Answers1