Parse XML escaped in CDATA mixed with invalid HTML

Question

I have the below element in a web service response. As you can see, it's escaped XML dumped as CDATA, so the XML parser just looks at it as a string and I'm unable to get the data I need from it through the usual means of XSLT and XPath. I need to turn this ugly string back into XML so that I can read it properly.

I have tried to do a search replace and simply converted all < to < and > to > and this works great, but there is a problem: The message.body element can actually contain HTML which is not valid XML. Might not even be valid HTML for all I know. So if I just replace everything, this will probably crash when I try to turn the string back into an XML document.

How can I unescape this safely? Is there a good way to do the replacement in the whole string except between the message.body open and closing tags for example?

<output>&lt;item type="object"&gt;
  &lt;ticket.id type="string"&gt;171&lt;/ticket.id&gt;
  &lt;ticket.title type="string"&gt;SoapUI Test&lt;/ticket.title&gt;
  &lt;ticket.created_at type="string"&gt;2013-12-03 12:50:54&lt;/ticket.created_at&gt;
  &lt;ticket.status type="string"&gt;Open&lt;/ticket.status&gt;
  &lt;updated type="string"&gt;false&lt;/updated&gt;
  &lt;message type="object"&gt;
    &lt;message.id type="string"&gt;520&lt;/message.id&gt;
    &lt;message.created_at type="string"&gt;2013-12-03 12:50:54.000&lt;/message.created_at&gt;
    &lt;message.author type="string"/&gt;
    &lt;message.body type="string"&gt;Just a test message...&lt;/message.body&gt;
  &lt;/message&gt;
  &lt;message type="object"&gt;
    &lt;message.id type="string"&gt;521&lt;/message.id&gt;
    &lt;message.created_at type="string"&gt;2013-12-03 13:58:32.000&lt;/message.created_at&gt;
    &lt;message.author type="string"/&gt;
    &lt;message.body type="string"&gt;Another message!&lt;/message.body&gt;
  &lt;/message&gt;
&lt;/item&gt;
</output>

score 0 · Answer 1 · answered Dec 10 '13 at 11:40

0

This is actually lifted from the project i'm working on right now.

    private Node stringToNode(String textContent) {
    Element node = null;
    try {
        node = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .parse(new ByteArrayInputStream(textContent.getBytes()))
                .getDocumentElement();

    } catch (SAXException e) {
        logger.error(e.getMessage(), e);
    } catch (IOException e) {
        logger.error(e.getMessage(), e);
    } catch (ParserConfigurationException e) {
        logger.error(e.getMessage(), e);
    }
    return node;
}

This will give you a document object representing the string. I use this to get this back into the original document:

if (textContent.contains(XML_HEADER)) {
  textContent = textContent.substring(textContent.indexOf(XML_HEADER) + XML_HEADER.length());
}
Node newNode = stringToNode(textContent);
if (newNode != null) {
  Node importedNode = soapBody.getOwnerDocument().importNode(newNode, true);
  nextChild.setTextContent(null);
  nextChild.appendChild(importedNode);
}

answered Dec 10 '13 at 11:40

M21B8

1,867
10
20

1

Note that this can only work if the text you're parsing is a well formed XML document with a single root-level element. If it might be a document _fragment_ with more than one element at the root level then you will need to use `org.w3c.dom.ls.LSParser` instead of `DocumentBuilder`, in particular the `parseWithContext` method can parse a fragment and insert the result directly into an existing DOM tree. – Ian Roberts Dec 10 '13 at 11:59
@IanRoberts Do you have an example using `org.w3c.dom.ls.LSParser`? – Svish Dec 10 '13 at 12:09
I'd be very interested to see that as well. would have saved me some trouble! – M21B8 Dec 10 '13 at 12:28
@Svish There's an example in [this answer](http://stackoverflow.com/a/20357496/592139) I gave to a similar question last week. Like so many things in the W3C DOM it's a bit cumbersome (that's what you get from a language-independent API designed by committee) but once you get your head around it it does the job well. – Ian Roberts Dec 10 '13 at 12:39
@IanRoberts Tried it out but getting a `NOT_SUPPORTED_ERR` when I try to call the `parseWithContext` method. Any idea why? – Svish Dec 10 '13 at 15:56
@Svish the Javadoc suggests that NOT_SUPPORTED_ERR is "raised if the LSParser doesn't support this method, or if the context node is of type Document and the DOM implementation doesn't support the replacement of the DocumentType child or Element child" - what kind of node is your "context"? – Ian Roberts Dec 10 '13 at 16:01

score 0 · Accepted Answer · answered Dec 12 '13 at 13:57

This is my current solution. You give it an XPath for the nodes that are messed up and a set of element names that might include messed up HTML and other problems. Works roughly as follows

Pull out text content of nodes matched by XPATH
Run regex to wrap problematic child elements in CDATA
Wrap text in temporary element (otherwise it crashes if there are multiple root nodes)
Parse text back to DOM
Add child nodes of temporary node back in place of previous text content.

The regex solution in step 2 is probably not fool-proof, but don't really see a better solution at the moment. If you do, let me know!

CDataFixer

import java.util.*;    
import javax.xml.xpath.*;    
import org.w3c.dom.*;

public class CDataFixer
{
    private final XmlHelper xml = XmlHelper.getInstance();

    public Document fix(Document document, String nodesToFix, Set<String> excludes) throws XPathExpressionException, XmlException
    {
        return fix(document, xml.newXPath().compile(nodesToFix), excludes);
    }

    private Document fix(Document document, XPathExpression nodesToFix, Set<String> excludes) throws XPathExpressionException, XmlException
    {
        Document wc = xml.copy(document); 

        NodeList nodes = (NodeList) nodesToFix.evaluate(wc, XPathConstants.NODESET);
        int nodeCount = nodes.getLength();

        for(int n=0; n<nodeCount; n++)
            parse(nodes.item(n), excludes);

        return wc;
    }

    private void parse(Node node, Set<String> excludes) throws XmlException
    {
        String text = node.getTextContent();

        for(String exclude : excludes)
        {
            String regex = String.format("(?s)(<%1$s\\b[^>]*>)(.*?)(</%1$s>)", Pattern.quote(exclude));
            text = text.replaceAll(regex, "$1<![CDATA[$2]]>$3");
        }

        String randomNode = "tmp_"+UUID.randomUUID().toString();

        text = String.format("<%1$s>%2$s</%1$s>", randomNode, text);

        NodeList parsed = xml
            .parse(text)
            .getFirstChild()
            .getChildNodes();

        node.setTextContent(null);
        for(int n=0; n<parsed.getLength(); n++)
            node.appendChild(node.getOwnerDocument().importNode(parsed.item(n), true));
    }
}

XmlHelper

import java.io.*;    
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;
import javax.xml.xpath.*;    
import org.w3c.dom.*;
import org.xml.sax.*;

public final class XmlHelper
{
    private static final XmlHelper instance = new XmlHelper(); 
    public static XmlHelper getInstance()
    {
        return instance;
    }


    private final SAXTransformerFactory transformerFactory;
    private final DocumentBuilderFactory documentBuilderFactory;
    private final XPathFactory xpathFactory;

    private XmlHelper()
    {
        documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);

        xpathFactory = XPathFactory.newInstance();

        TransformerFactory tf = TransformerFactory.newInstance();
        if (!tf.getFeature(SAXTransformerFactory.FEATURE))
            throw new RuntimeException("Failed to create SAX-compatible TransformerFactory.");
        transformerFactory = (SAXTransformerFactory) tf;
    }

    public DocumentBuilder newDocumentBuilder()
    {
        try
        {
            return documentBuilderFactory.newDocumentBuilder();
        }
        catch (ParserConfigurationException e)
        {
            throw new RuntimeException("Failed to create new "+DocumentBuilder.class, e);
        }
    }

    public XPath newXPath()
    {
        return xpathFactory.newXPath();
    }

    public Transformer newIdentityTransformer(boolean omitXmlDeclaration, boolean indent)
    {
        try
        {
            Transformer transformer = transformerFactory.newTransformer();
            transformer.setOutputProperty(OutputKeys.INDENT, indent ? "yes" : "no");
            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, omitXmlDeclaration ? "yes" : "no");
            return transformer;
        }
        catch (TransformerConfigurationException e)
        {
            throw new RuntimeException("Failed to create Transformer instance: "+e.getMessage(), e);
        }
    }

    public Templates newTemplates(String xslt) throws XmlException
    {
        try
        {
            return transformerFactory.newTemplates(new DOMSource(parse(xslt)));
        }
        catch (TransformerConfigurationException e)
        {
            throw new RuntimeException("Failed to create templates: "+e.getMessage(), e);
        }
    }

    public Document parse(String xml) throws XmlException
    {
        return parse(new InputSource(new StringReader(xml)));
    }

    public Document parse(InputSource xml) throws XmlException
    {
        try
        {
            return newDocumentBuilder().parse(xml);
        }
        catch (SAXException e)
        {
            throw new XmlException("Failed to parse xml: "+e.getMessage(), e);
        }
        catch (IOException e)
        {
            throw new XmlException("Failed to read xml: "+e.getMessage(), e);
        }
    }

    public String toString(Node node)
    {
        return toString(node, true, false);
    }

    public String toString(Node node, boolean omitXMLDeclaration, boolean indent)
    {
        try
        {
            StringWriter writer = new StringWriter();

            newIdentityTransformer(omitXMLDeclaration, indent)
                .transform(new DOMSource(node), new StreamResult(writer));

            return writer.toString();
        }
        catch (TransformerException e)
        {
            throw new RuntimeException("Failed to transform XML into string: " + e.getMessage(), e);
        }
    }

    public Document copy(Document document)
    {
        DOMSource source = new DOMSource(document);
        DOMResult result = new DOMResult();

        try
        {
            newIdentityTransformer(true, false)
                .transform(source, result);
            return (Document) result.getNode();
        }
        catch (TransformerException e)
        {
            throw new RuntimeException("Failed to copy XML: " + e.getMessage(), e);
        }
    }
}

Parse XML escaped in CDATA mixed with invalid HTML

2 Answers2