Java 7, regexes and supplementary unicode characters

Question

The string in question has a supplementary unicode character "\ud84c\udfb4". According to javadoc, regex matching should be done at code point level not character level. However, the split code below treats low surrogate (\udfb4) as non word character and splits on it.

Am I missing something? What are other alternatives to accomplish splitting on non-word characters? (Java version "1.7.0_07")

Thanks in advance.

Pattern non_word_regex = Pattern.compile("[\\W]", Pattern.UNICODE_CHARACTER_CLASS);
String a = "\u529f\u80fd\u0020\u7d76\ud84c\udfb4\u986f\u793a\u5ee3\u544a";
String b ="功能 絶顯示廣告";
System.out.print("original "+a+"\norginal hex ");
for(char c : a.toCharArray()){
    System.out.print(Integer.toHexString((int)c));
    System.out.print(' ');
}
System.out.println();

String[] tokens = non_word_regex.split(a);

for(int i =0; i< tokens.length; i++){
   String token = tokens[i];
   System.out.print(i+" ");
   for(char c : token.toCharArray()){
       System.out.print(Integer.toHexString((int)c));
       System.out.print(' ');
   }
   System.out.println();
}

Output:
original 功能絶顯示廣告
orginal hex 529f 80fd 20 7d76 d84c dfb4 986f 793a 5ee3 544a
0 529f 80fd
1 7d76 d84c
2 986f 793a 5ee3 544a

Malcolm · Accepted Answer · 2013-12-10T20:45:56.727

This looks simply like a bug in the regex engine. If you use the \w expression, everything matches correctly, remains to be a single code point composed of two chars. This can be easily verified by running the following code:

Pattern pattern = Pattern.compile("(?U)[\\w]");
String str = "功能 絶顯示廣告";

Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.toMatchResult().group());
}

I've just made a through investigation, and so I can tell you where the problem is. If you look at the method compile() in java.util.regex.Pattern (start on the line 1625), you will see the code that scans the regex for supplementary characters and decides whether to support them in scanning or not.

The problem with this approach is that the code doesn't take into account the fact that even if the regex doesn't have supplementary characters, it may still want to match them, as it happens in your case, for example.

The solution is to devise some regex that contains the supplementary characters, but they don't affect the matching process. I suggest you use something innocent like this:

Pattern nonWordRegex = Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]");

The part (?!\uDB80\uDC00) does the trick. This is a negative lookahead for a character in the private range of supplementary characters, which means that most likely you won't find it in the text. And voila: the regex engine thinks that there are supplementary characters in the pattern, and turns on their support!

Unfortunately, Matcher does not preserve word boundaries. Using "[^\\w]" predictively, gives same result as "[\\W]" Should I post it on java boards somewhere? — user3088039, Dec 10 '13 at 20:36
@user3088039 I've just resolved the issue! Check the answer again, I've updated it. — Malcolm, Dec 10 '13 at 20:46
You'd think "(?U)" would turn on supplementary character support. Thanks for looking under the covers. It works beautifully. — user3088039, Dec 10 '13 at 20:54

Java 7, regexes and supplementary unicode characters

1 Answers1