Replacing newlines in XML attributes with XSLT

Question

I need some XSLT (or something - see below) to replace newlines in all attributes with an alternative character.

I am having to process legacy XML which stores all data as attributes, and uses new-lines to express cardinality. For example:

<sample>
    <p att="John
    Paul
    Ringo"></p>
</sample>

These new-lines are being replaced with whitespace when I parse the file in Java (as per the XML spec), however I am wishing to treat them as a list so this behaviour isn't particularly useful.

My 'solution' was to use XSLT to replace all newlines in all attributes with some other delimiter - but I have zero knowledge of XSLT. All examples I've seen thus far have either been very specific or have replaced node content instead of attribute values.

I have dabbled with XSLT 2.0's replace() but am having a hard time putting everything together.

Is XSLT even the correct solution? With the XSLT below:

<xsl:template match="sample/*">
    <xsl:for-each select="@*">
        <xsl:value-of select="replace(current(), '\n', '|')"/>
    </xsl:for-each>
</xsl:template>

applied to the sample XML outputs the following using Saxon:

John Paul Ringo

Obviously this format isn't what I'm after - this is just to experiment with replace() - but have the newlines already been normalised by the time we get to XSLT processing? If so, are there any other ways to parse these values as writ using a Java parser? I've only used JAXB thus far.

I have a very nasty feeling that I may need to don my rubber gloves and implement a filthy regex on the the XML string prior to parsing. Unfortunately I have no control over the XML being produced. — nullPainter, Jul 02 '13 at 07:29
If the whitespace within attribute values is semantically significant then you're not dealing with XML, and you'll need to use a non-XML tool to handle it. [Per spec](http://www.w3.org/TR/xml/#AVNormalize) all newlines within an attribute value _must_ be converted to spaces by the parser, and if you want a newline character in the value that you see after parsing then it must be escaped as a character reference (` `) — Ian Roberts, Jul 02 '13 at 08:29
I don't disagree with you. The XML is exported from an application which will remain nameless. It's not _entirely_ the application's fault, although stuffing all data in attributes is a arguably a somewhat dubious approach. I suspect the users have worked around a lack of 1:M cardinality for this particular field by using newlines which the application blindly exported unadulterated to XML. — nullPainter, Jul 02 '13 at 09:35
I might do some research into any Java libraries which are designed for dubious XML - this can't be an isolated instance so I'm sure somebody out there has written a deliberately loose / forgiving parser. — nullPainter, Jul 02 '13 at 09:37

score 2 · Answer 1 · edited May 23 '17 at 12:25

2

It seem's to be hard to make this. As I found in Are line breaks in XML attribute values allowed? - new line character in attribute is valid but XML parser normalizes it (https://stackoverflow.com/a/8188290/1324394) so it is probably lost before processing (and thus before replacing).

edited May 23 '17 at 12:25

Community

1
1

answered Jul 02 '13 at 07:22

Jirka Š.

3,388
2
15
17

I saw that too, but I was hoping that they'd still be there for some XSLT fix-ups. I have since found http://jdom.org/ which skirts around the problem by not claiming to be an XML parser, which presumably relieves it of having to comply with the XML spec. Going to give it a shot now... – nullPainter Jul 02 '13 at 08:06
Just thinking aloud, you could do something like this `replace(/data/@value, '\s{2,10}','|')` - it is not absolutely correct because it relies that there would be more than one space instead of newline but it could make a job. – Jirka Š. Jul 02 '13 at 08:10
@JirkaŠ. no, that wouldn't work, because the XML parser collapses all consecutive whitespace in attribute values to a single space before the data gets as far as the XPath data model. – Ian Roberts Jul 02 '13 at 08:12
I was afraid about that but I tried in Altova and it worked. Might be it is just Altova specificity. – Jirka Š. Jul 02 '13 at 08:17
1

Ah, I see I missed the crucial sentence in the [spec](http://www.w3.org/TR/xml/#AVNormalize): "All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA." - so if you don't have a DTD the parser will replace newlines with spaces but _won't_ collapse consecutive spaces to a single space. – Ian Roberts Jul 02 '13 at 08:27
Re: JDom investigation - despite an explicit indication to the contrary at http://stackoverflow.com/a/10439549/1239406, attributes are normalised by Xerces even before getting to JDom. – nullPainter Jul 02 '13 at 09:03

score 1 · Answer 2 · answered Jul 02 '13 at 12:06

1

XSLT only sees the XML after it has been processed by the XML parser, which will have done the attribute value normalization.

I think that some XML parsers have an option to suppress attribute value normalization. If you don't have access to such a parser, I think that doing a textual replace of (\r?\n) by 
 prior to parsing might be your best escape route. Newlines that are escaped in this way don't get splatted by attribute value normalization.

answered Jul 02 '13 at 12:06

Michael Kay

156,231
11
92
164

Thanks Michael. After doing a reasonable amount of digging, I'm coming up with blanks trying to find a Java-based parser which allows for suppression of attribute value normalisation. Textual replacement is difficult as I have no control over the XML being produced. This means that I can't limit the replacement to attribute values. – nullPainter Jul 02 '13 at 23:30

nullPainter · Accepted Answer · 2013-07-03T08:39:00.343

I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.

My code is as follows:

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

For the sample XML in my question, the output of this method is:

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

Note that I am not using 
 because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.

Note that the downside of using JSoup is that it currently converts attribute names to lowercase. There is an [open bug](https://github.com/jhy/jsoup/issues/272) detailing this. — nullPainter, Jul 03 '13 at 02:33

Replacing newlines in XML attributes with XSLT

3 Answers3