Java DOM transforming and parsing arbitrary strings with invalid XML characters?

Question

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don't have a given invalid (or not well-formed) XML file but rather a given arbitrary Java String which may or may not contain an invalid XML character. I want to create a DOM Document containing a Text node with the given String, then transform it to a file. When the file is parsed to a DOM Document I want to get a String which is equal to the initial given String. I create the Text node with org.w3c.dom.Document#createTextNode(String data) and I get the String with org.w3c.dom.Node#getTextContent().

As you can see in https://stackoverflow.com/a/28152666/3882565 there are some invalid characters for Text nodes in a XML file. Actually there are two different types of "invalid" characters for Text nodes. There are predefined entities such as ", &, ', < and > which are automatically escaped by the DOM API with ", &, ', < and > in the resulting file which is undone by the DOM API when the file is parsed. Now the problem is that this is not the case for other invalid characters such as '\u0000' or '\uffff'. An exception occurs when parsing the file because '\u0000' and '\uffff' are invalid characters.

Probably I have to implement a method which escapes those characters in the given String in a unique way before submitting it to the DOM API and undo that later when I get the String back, right? Is there a better way to do this? Did someone implement those or similar methods in the past?

Edit: This question was marked as duplicate of Best way to encode text data for XML in Java?. I have now read all of the answers but none of them solves my problem. All of the answers suggest:

Using a XML library such as the DOM API which I already do and none of those libraries actually replaces invalid characters except ", &, ', <, > and a few more.
Replacing all invalid characters by "&#number;" which results in an exception for invalid characters such as "" when parsing the file.
Using a third party library with an XML encode method which do not support illegal characters such as "" (they are skipped in some libraries).
Using a CDATA section which doesn't support invalid characters either.

Why do you need any other characters escaped? Can you demonstrate that characters other than quotes, ampersands, less-than and greater-than are not coming back unaltered? — VGR, Dec 22 '19 at 21:20
@VGR yes, a ```Document``` containing a ```Text``` node created with ```String.valueOf('\uffff')``` can be transformed to a file but results in an excpetion when the file is parsed. — stonar96, Dec 22 '19 at 21:38
This question should be reopened (see: edit in the question and the latest comment) — stonar96, Dec 22 '19 at 21:44
`\uffff` is [not a valid character](http://www.fileformat.info/info/unicode/char/ffff/). Is this really text data, or is it bytes? Bytes should not be stored as text. — VGR, Dec 22 '19 at 21:57
@VGR ```'\uffff'``` is just one example, ```'\u0000'``` also does not work. As I have stated in my question, I need a method to store an arbitrary ```String``` in the XML file. — stonar96, Dec 22 '19 at 22:04
Control characters from `\u0000` to `\u001f` are also invalid (except for tab, CR and LF), [according to the XML specification](https://www.w3.org/TR/xml/#charsets). I ask again, are you sure this is really supposed to be text? — VGR, Dec 22 '19 at 22:08
@VGR I know the XML specification and I have also said that in my question that these characters are invalid. I don't know for which ```String```s the end user uses the API but it should work for an arbitrary ```String```. — stonar96, Dec 22 '19 at 22:15
Ah, so you did. I know this isn’t much of an answer, but I would assume that the data is not really text data at all, so I would store it in a binary format like [base64Binary](https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#base64Binary). If you intend to transform the content itself, then you will indeed need to come up with some escape-like mechanism of representing those invalid characters. — VGR, Dec 22 '19 at 22:20
***This material has been covered thoroughly multiple times.*** Either strip the control et. al. characters that are illegal, or encode them in some manner such as Base64. There are libraries available to help you. Sorry, but there's just nothing unique about your question. Additional duplicates added. There are many more. — kjhughes, Dec 22 '19 at 23:36
If it must be XML, I'll toss out the idea to maybe use a `CDATA` section, perhaps with the encoded Base64. Seems like the closest fit from the XML spec. — markspace, Jan 01 '20 at 19:33
@markspace a ```CDATA``` section alone wouldn't solve the problem at all. Invalid XML characters (Unicode code points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]) are also invalid in a ```CDATA``` section. In combination with Base64 encoding it will work, true. Thanks for suggesting the ```CDATA``` section. — stonar96, Jan 01 '20 at 19:43

score 1 · Answer 1 · edited Jan 02 '20 at 19:14

1

One technique is to encode the whole string as Base64-encoded-UTF8.

But if the "special" characters are rare, that's a significant sacrifice in readability and file size.

Another technique is to represent special characters as processing instructions, for example <?U 0000?> for codepoint 0.

Another would be to use backslash escaping, for example \u0000 for codepoint 0, and of course \ for backslash itself. This has the advantage that you can probably find existing library routines that do this for you (for example JSON conversion libraries). I can't imagine why your requirements say you can't use such libraries; but if you really can't, then it's not hard to write the code yourself.

edited Jan 02 '20 at 19:14

Daniel Haley

51,389
6
69
95

answered Jan 01 '20 at 22:10

Michael Kay

156,231
11
92
164

Thanks for the answer. I didn't mean that I can't use a further library but I can't use another library instead of the DOM API. The last part of your answer is basically what I do in my answer, except that I use ```#``` instead of ```\```. Can you show me an example how to use such a library or can you recommend a specific library for that? – stonar96 Jan 01 '20 at 22:24
`StringEscapeUtils` in Apache Commons is often recommended. I don't use it myself because we have our own routines in Saxon. – Michael Kay Jan 01 '20 at 22:46
```StringEscapeUtils``` doesn't look promising for me. It has escape and unescape methods for some specific languages but I can't escape specific characters in the way I want with that. – stonar96 Jan 01 '20 at 23:10
how did you use the StringEscapeUtils? StringEscapeUtils.escapeXml11(...) and StringEscapeUtils.unescapeXml(...) to get the original values back? – mahieus Jan 02 '20 at 17:06
@mahieus these are not the functions I am looking for. The DOM API already does this and this function will simple remove invalid characters which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]. – stonar96 Jan 02 '20 at 18:16

Mordechai · Answer 2 · 2020-01-02T21:24:04.463

1

I think the simplest solution is using XML 1.1 (supported by org.w3c.dom) by using this preprocessor:

<?xml version=1.1 encoding=UTF-8 standalone=yes?>

According to Wikipedia the only invalid characters in XML 1.1 are U+0000, surrogates, U+FFFE and U+FFFF

This code snippet ensures you always get a correct XML 1.1 string, omitting illegal chars (might not be what you looks for though if you need the exact same string back):

public static String escape(String orig) {
    StringBuilder builder = new StringBuilder();

    for (char c : orig.toCharArray()) {
        if (c == 0x0 || c == 0xfffe || c == 0xffff || (c >= 0xd800 && c <= 0xdfff)) {
            continue;
        } else if (c == '\'') {
            builder.append("&apos;");
        } else if (c == '"') {
            builder.append("&quot;");
        } else if (c == '&') {
            builder.append("&amp;");
        } else if (c == '<') {
            builder.append("&lt;");
        } else if (c == '>') {
            builder.append("&gt;");
        } else if (c <= 0x1f) {
            builder.append("&#" + ((int) c) + ";");
        } else {
            builder.append(c);
        }
    }

    return builder.toString();
}

edited Jan 02 '20 at 21:24

answered Jan 02 '20 at 19:42

Mordechai

15,437
2
41
82

True, XML 1.1 already supports a few more characters than XML 1.0 as described in this answer https://stackoverflow.com/a/28152666/3882565, however, it still doesn't support an arbitrary ```String```. Also note that there are a few more invalid characters as listed in the linked answer. – stonar96 Jan 02 '20 at 19:59
Control characters are a real practical problem when storing arbitrary strings in XML, which are legal in 1.1; those 3 illegal chars are far too rare. – Mordechai Jan 02 '20 at 20:16
Sorry, I left out the surrogates range. It is in fact more than 3 chars, but still ultra rare. – Mordechai Jan 02 '20 at 21:19
The 0xFFFE and 0xFFFF points are [guaranteed to never be a Unicode character](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#ref_UFFF0_black). – Mordechai Jan 02 '20 at 21:28
The added function probably works fine but it's not really what I am looking for. The characters ```"```, ```&```, ```'```, ```<```, ```>``` and a few more are already escaped and unescaped by the DOM API. In my answer I have used an alternative escaping and unescaping system for the invalid characters. Your code does basically the same as [StringEscapeUtils.escapeXml11(String input)](https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html#escapeXml11-java.lang.String-) – stonar96 Jan 02 '20 at 21:36

stonar96 · Accepted Answer · 2023-02-24T13:02:55.157

As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

    String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
    Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Element element = document.createElement("element");
    element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
    document.appendChild(element);
    TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
    // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text&lt;text&amp;text##</element>
    document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
    System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
    // prints true

escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

/**
 * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
 * DOM API already escapes predefined entities, such as {@code "}, {@code &},
 * {@code '}, {@code <} and {@code >} for
 * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
 * code points are ignored by this function. However, there are some other
 * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
 * invalid in their escaped form, such as {@code "&#0;"}.
 * <p>
 * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
 * points that are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
 * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
 * {@code "#c;"}, where <code>c</code> is the Unicode code point.
 *
 * @param string the <code>{@link String}</code> to be escaped
 * @return the escaped <code>{@link String}</code>
 * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
 */
public static final String escapeInvalidXmlCharacters(String string) {
    if (string == null) {
        throw new IllegalArgumentException("string cannot be null");
    }

    StringBuilder stringBuilder = new StringBuilder();

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (codePoint == '#') {
            stringBuilder.append("##");
        } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
            stringBuilder.appendCodePoint(codePoint);
        } else {
            stringBuilder.append("#" + codePoint + ";");
        }
    }

    return stringBuilder.toString();
}

/**
 * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
 * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
 *
 * @param string the <code>{@link String}</code> to be unescaped
 * @return the unescaped <code>{@link String}</code>
 * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
 */
public static final String unescapeInvalidXmlCharacters(String string) {
    if (string == null) {
        throw new IllegalArgumentException("string cannot be null");
    }

    StringBuilder stringBuilder = new StringBuilder();
    boolean escaped = false;

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (escaped) {
            stringBuilder.appendCodePoint(codePoint);
            escaped = false;
        } else if (codePoint == '#') {
            StringBuilder intBuilder = new StringBuilder();
            int j;

            for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                codePoint = string.codePointAt(j);

                if (codePoint == ';') {
                    escaped = true;
                    break;
                }

                if (codePoint >= 48 && codePoint <= 57) {
                    intBuilder.appendCodePoint(codePoint);
                } else {
                    break;
                }
            }

            if (escaped) {
                try {
                    codePoint = Integer.parseInt(intBuilder.toString());
                    stringBuilder.appendCodePoint(codePoint);
                    escaped = false;
                    i = j;
                } catch (IllegalArgumentException e) {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                codePoint = '#';
                escaped = true;
            }
        } else {
            stringBuilder.appendCodePoint(codePoint);
        }
    }

    return stringBuilder.toString();
}

Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

Java DOM transforming and parsing arbitrary strings with invalid XML characters?

3 Answers3

Linked