2

I need some XSLT (or something - see below) to replace newlines in all attributes with an alternative character.

I am having to process legacy XML which stores all data as attributes, and uses new-lines to express cardinality. For example:

<sample>
    <p att="John
    Paul
    Ringo"></p>
</sample>

These new-lines are being replaced with whitespace when I parse the file in Java (as per the XML spec), however I am wishing to treat them as a list so this behaviour isn't particularly useful.

My 'solution' was to use XSLT to replace all newlines in all attributes with some other delimiter - but I have zero knowledge of XSLT. All examples I've seen thus far have either been very specific or have replaced node content instead of attribute values.

I have dabbled with XSLT 2.0's replace() but am having a hard time putting everything together.

Is XSLT even the correct solution? With the XSLT below:

<xsl:template match="sample/*">
    <xsl:for-each select="@*">
        <xsl:value-of select="replace(current(), '\n', '|')"/>
    </xsl:for-each>
</xsl:template>

applied to the sample XML outputs the following using Saxon:

John Paul Ringo

Obviously this format isn't what I'm after - this is just to experiment with replace() - but have the newlines already been normalised by the time we get to XSLT processing? If so, are there any other ways to parse these values as writ using a Java parser? I've only used JAXB thus far.

nullPainter
  • 2,676
  • 3
  • 22
  • 42
  • I have a very nasty feeling that I may need to don my rubber gloves and implement a filthy regex on the the XML string prior to parsing. Unfortunately I have no control over the XML being produced. – nullPainter Jul 02 '13 at 07:29
  • Actually no, that would be too horrid to consider. – nullPainter Jul 02 '13 at 07:35
  • If the whitespace within attribute values is semantically significant then you're not dealing with XML, and you'll need to use a non-XML tool to handle it. [Per spec](http://www.w3.org/TR/xml/#AVNormalize) all newlines within an attribute value _must_ be converted to spaces by the parser, and if you want a newline character in the value that you see after parsing then it must be escaped as a character reference (` `) – Ian Roberts Jul 02 '13 at 08:29
  • I don't disagree with you. The XML is exported from an application which will remain nameless. It's not _entirely_ the application's fault, although stuffing all data in attributes is a arguably a somewhat dubious approach. I suspect the users have worked around a lack of 1:M cardinality for this particular field by using newlines which the application blindly exported unadulterated to XML. – nullPainter Jul 02 '13 at 09:35
  • I might do some research into any Java libraries which are designed for dubious XML - this can't be an isolated instance so I'm sure somebody out there has written a deliberately loose / forgiving parser. – nullPainter Jul 02 '13 at 09:37

3 Answers3

2

It seem's to be hard to make this. As I found in Are line breaks in XML attribute values allowed? - new line character in attribute is valid but XML parser normalizes it (https://stackoverflow.com/a/8188290/1324394) so it is probably lost before processing (and thus before replacing).

Community
  • 1
  • 1
Jirka Š.
  • 3,388
  • 2
  • 15
  • 17
  • I saw that too, but I was hoping that they'd still be there for some XSLT fix-ups. I have since found http://jdom.org/ which skirts around the problem by not claiming to be an XML parser, which presumably relieves it of having to comply with the XML spec. Going to give it a shot now... – nullPainter Jul 02 '13 at 08:06
  • Just thinking aloud, you could do something like this `replace(/data/@value, '\s{2,10}','|')` - it is not absolutely correct because it relies that there would be more than one space instead of newline but it could make a job. – Jirka Š. Jul 02 '13 at 08:10
  • @JirkaŠ. no, that wouldn't work, because the XML parser collapses all consecutive whitespace in attribute values to a single space before the data gets as far as the XPath data model. – Ian Roberts Jul 02 '13 at 08:12
  • I was afraid about that but I tried in Altova and it worked. Might be it is just Altova specificity. – Jirka Š. Jul 02 '13 at 08:17
  • 1
    Ah, I see I missed the crucial sentence in the [spec](http://www.w3.org/TR/xml/#AVNormalize): "All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA." - so if you don't have a DTD the parser will replace newlines with spaces but _won't_ collapse consecutive spaces to a single space. – Ian Roberts Jul 02 '13 at 08:27
  • Re: JDom investigation - despite an explicit indication to the contrary at http://stackoverflow.com/a/10439549/1239406, attributes are normalised by Xerces even before getting to JDom. – nullPainter Jul 02 '13 at 09:03
1

XSLT only sees the XML after it has been processed by the XML parser, which will have done the attribute value normalization.

I think that some XML parsers have an option to suppress attribute value normalization. If you don't have access to such a parser, I think that doing a textual replace of (\r?\n) by &#x0A; prior to parsing might be your best escape route. Newlines that are escaped in this way don't get splatted by attribute value normalization.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks Michael. After doing a reasonable amount of digging, I'm coming up with blanks trying to find a Java-based parser which allows for suppression of attribute value normalisation. Textual replacement is difficult as I have no control over the XML being produced. This means that I can't limit the replacement to attribute values. – nullPainter Jul 02 '13 at 23:30
1

I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.

My code is as follows:

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

For the sample XML in my question, the output of this method is:

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

Note that I am not using &#10; because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.

nullPainter
  • 2,676
  • 3
  • 22
  • 42
  • 1
    Note that the downside of using JSoup is that it currently converts attribute names to lowercase. There is an [open bug](https://github.com/jhy/jsoup/issues/272) detailing this. – nullPainter Jul 03 '13 at 02:33