Java RegEx matcher breaks characters outside the BMP

Question

I'm currently writing a util class to sanitize input, that is saved to an xml document. Sanitizing for us means, that all illegal characters (https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0) are just removed from the string.

I tried to do this by just using some regex, that replaces all invalid characters with an empty string, but for unicode characters outside the BMP, this seems to break the encoding somehow, leaving me with those ? characters. It also does not seem to matter which way of replacing by regexp I use (String#replaceAll(String, String), Pattern#compile(String), org.apache.commons.lang3.RegExUtil#removeAll(String, String))

Here's an example implementation with a test (in Spock), that shows the problem: XmlStringUtil.java

package com.example.util;

import lombok.NonNull;

import java.util.regex.Pattern;

public class XmlStringUtil {

    private static final Pattern XML_10_PATTERN = Pattern.compile(
        "[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"
    );

    public static String sanitizeXml10(@NonNull String text) {
        return XML_10_PATTERN.matcher(text).replaceAll("");
    }

}

XmlStringUtilSpec.groovy

package com.example.util

import spock.lang.Specification

class XmlStringUtilSpec extends Specification {

    def 'sanitize string values for xml version 1.0'() {
        when: 'a string is sanitized'
            def sanitizedString = XmlStringUtil.sanitizeXml10 inputString

        then: 'the returned sanitized string matches the expected one'
            sanitizedString == expectedSanitizedString

        where:
            inputString                                | expectedSanitizedString
            ''                                         | ''
            '\b'                                       | ''
            '\u0001'                                   | ''
            'Hello World!\0'                           | 'Hello World!'
            'text with emoji \uD83E\uDDD1\uD83C\uDFFB' | 'text with emoji \uD83E\uDDD1\uD83C\uDFFB'
    }

}

I have now a solution, where I rebuild the whole string from its single code points, but that does not seem to be the correct solution.

Thanks in advance!

According to [this](https://stackoverflow.com/questions/26823484/replacing-emoji-unicode-range-from-arabic-tweets-using-java/26838867#26838867) and [this](https://stackoverflow.com/questions/26897810/using-java-regexes-to-match-a-range-of-unicode-code-points-outside-the-bmp-it) a regex should work with "outside" characters. Are you sure it's not just a font problem? — Sascha, May 23 '19 at 13:50
I thought so too (at first), but the test says something different. It fails for the last entry of the where block. — Max N., May 23 '19 at 14:40
These emojis fall into the forbidden range between D7FF and E000 and shouldn't get through at all. — Sascha, May 23 '19 at 14:59
If I insert spaces around them then the result is only the spaces. Therefore Java interprets the string different when writing them together. Even printing the codepoints from ``"\uD83E\uDDD1\uD83C\uDFFB"`` shows ``0x1f9d1`` and ``0x1f3fb``. — Sascha, May 23 '19 at 15:22
The emojis are valid XML. As you stated in the second post, they are in the allowed range 0x10000 to 0x10FFFF. This is also stated in the Wikipedia article. — Max N., May 23 '19 at 15:58
I read about this unicode surrogate pair business and stand corrected. That lead to the working regex solution in my second answer. — Sascha, May 24 '19 at 08:40

score 1 · Answer 1 · answered May 23 '19 at 14:02

A solution without regex could be a filtered code point stream:

public static String sanitize_xml_10(String input) {
    return input.codePoints()
            .filter(Test::allowedXml10)
            .collect(StringBuilder::new,StringBuilder::appendCodePoint, StringBuilder::append)
            .toString();
}

private static boolean allowedXml10(int codepoint) {
    if(0x0009==codepoint) return true;
    if(0x000A==codepoint) return true;
    if(0x000D==codepoint) return true;
    if(0x0020<=codepoint && codepoint<=0xD7FF) return true;
    if(0xE000<=codepoint && codepoint<=0xFFFD) return true;
    if(0x10000<=codepoint && codepoint<=0x10FFFF) return true;
    return false;
}

Yeah, that's basically what I did. And it works. But it feels strange, that it is not possible to do this on String level already.. — Max N., May 23 '19 at 14:36

score 1 · Accepted Answer · answered May 24 '19 at 08:39

1

After some reading and experimenting, a slight change to the Regex (replacing the \x{..} with the surrogates \u...\u... works:

private static final Pattern XML_10_PATTERN = Pattern.compile(
        "[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\uD800\uDC00-\uDBFF\uDFFF]"
    );

Check:

sanitizeXml10("\uD83E\uDDD1\uD83C\uDFFB").codePoints().mapToObj(Integer::toHexString).forEach(System.out::println);

results in

1f9d1
1f3fb

answered May 24 '19 at 08:39

Sascha

1,320
10
16

Ok, this was basically it. Strangely I had to remove all escapes from the _character point groups_ but was not allowed to do remove them from the single character points. So the working RegEx is `[^\u0009\u000A\u000D -퟿-�-]` – Max N. May 24 '19 at 10:14

Java RegEx matcher breaks characters outside the BMP

2 Answers2