0

According to (for instance) this discussion, newlines in strings in XML files are allowed. I have such a file and I need to manipulate it. To be specific, From certain subelements with an existing id, I need to create duplicates with a new id. I have a test python script to do this:

import xml.etree.ElementTree as ET
from xml.dom import minidom
import copy

def double_override_entries(input_file, search_id, replace_id, output_file):
    tree = ET.parse(input_file)
    root = tree.getroot()

    for override_elem in root.findall(".//override"):
        for language_elem in override_elem.findall(".//language[@id='{}']".format(search_id)):
            language_elem_copy = copy.deepcopy(language_elem)
            language_elem_copy.set('id', replace_id)
            override_elem.append( language_elem_copy)


    xml_str = ET.tostring(root, encoding='utf-8').decode('utf-8')

    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(xml_str)

input_file = "input.xml"
search_id = "yy"
replace_id = "zz"
output_file = "output.xml"

double_override_entries(input_file, search_id, replace_id, output_file)

This works, except for one thing: the newlines inside strings in the original XML have been removed. I'm looking for a way to do this same manipulation, but preserving these newlines inside property value strings.

Input:

<?xml version="1.0" encoding="UTF-8"?>
<model>
  <overrides>
    <override identifier="id-12391" labelPlacement="manual" labelOffset="(35,20)">
      <language id="en" labelText="Internal
        Exploitation
        Dependencies"/>
      <language id="yy" labelText="Internal
        Exploitation
        Dependencies"/>
    </override>
  </overrides>
</model>

Output:

<model>
  <overrides>
    <override identifier="id-12391" labelPlacement="manual" labelOffset="(35,20)">
      <language id="en" labelText="Internal         Exploitation         Dependencies" />
      <language id="yy" labelText="Internal         Exploitation         Dependencies" />
    <language id="zz" labelText="Internal         Exploitation         Dependencies" />
    </override>
  </overrides>
</model>

[Update] ElementTree is compliant (see answer below), so removed that statement from the question. I still would like a python workaround for this non-compliant use, though. [/Update]

Any trick to do this would be fine, though I need to run it on macOS + MacPorts (currently using python 3.11)

gctwnl
  • 217
  • 1
  • 8

1 Answers1

2

I found this SO question and answer which postulates that newlines in attributes should be passed from input to output. The most upvoted answer refers to this part of the XML spec:

5.2 Character Escaping
The XML 1.0 specification requires XML processors to perform certain simple transformations on white-space characters in XML documents, when they serve as line separators and when they appear in attribute values. For each character in the result of the transformation, there will be a character information item as described by the Information Set. [...]

But this only partly applies to this situation, because digging deeper into the topic, you will find the chapter about attribute value normalization, which states:

3.3.3 Attribute-Value Normalization

Before the value of an attribute is passed to the application or checked for validity, the XML processor MUST normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm.

And this algorithm is:

  1. All line breaks MUST have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.
  2. Begin with a normalized value consisting of the empty string.
  3. For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:
    • For a character reference, append the referenced character to the normalized value.
    • For an entity reference, recursively apply step 3 of this algorithm to the replacement text of the entity.
    • For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
    • For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a white space character; being recursively processed, the white space character is replaced with a space character (#x20) in the normalized value.


I tested your input with xsltproc and the identity template and the output is like yours. So it appears to be conforming to the standard.

Applying the algorithm from the specs above, the output is as expected. For comparability the normalization algorithm is applied that transforms every #xA to a #x20 in step 3.3.


So the answer to your question is apparently: No!
Solution: put the values containing newlines in elements and not attributes.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • Correct answer. Mine would have been shorter: XML mandates that newlines in attribute values are converted to spaces. Yes, it's a pain. – Michael Kay Jun 02 '23 at 10:44
  • There is a partial workaround option and it's quoted in your answer. It's to use character references. See https://stackoverflow.com/a/63142223/317052 for an example on why it's a partial workaround. – Daniel Haley Jun 02 '23 at 14:27
  • Thank you. I'll have to live with this behaviour, which indeed is a pain because I need people to edit XML files and 'infinite' strings with encoded newlines are not a real option there. As long as these work in my application, I'll keep as is. When that ends, I'll have to add a transformation from 'bad' to 'good' XML. I still would use a workaround to do this 'non-conformant' XML stuff in a python script, though. – gctwnl Jun 03 '23 at 12:06
  • Well, if you restructured the XML layout, you could move the values containing newlines from attributes to elements as suggested in the last line of my answer. – zx485 Jun 03 '23 at 12:39