9

I am trying to find why this regex in JAVA ([\ud800-\udbff\udc00-\udfff]) used in replaceAll(regexp,"") is removing also the hyphen-minus character, along with the surrogate characters.

The Unicode for this one is \u002d so it does not seem to be inside any of those ranges.

I could easily remove this behaviour adding &&[^\u002d] resulting in ([\ud800-\udbff\udc00-\udfff&&[^\u002d]])

But, as I do not know why this \u002d is removed, I think there could be more unnoticed chars being removed.

Example:

String text = "A\u002dB";
System.out.println(text);
String regex = "([\ud800-\udbff\udc00-\udfff])";
System.out.println(text.replaceAll(regex, "X"));

prints:

A-B
AXB
Cœur
  • 37,241
  • 25
  • 195
  • 267
Issus
  • 93
  • 1
  • 4
  • 2
    Can you post a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve)? – Tim Pietzcker Jan 07 '15 at 13:53
  • I am not sure the ranges work for unicode (\u) escaped characters. Are these valid (single UTF-16) characters? – Gábor Bakos Jan 07 '15 at 13:59
  • Does escaping the backslashes help? `String regex = "([\\ud800-\\udbff\\udc00-\\udfff])";` – Tim Pietzcker Jan 07 '15 at 14:04
  • I was adding an example, but Pshemo kindly did on my behalf, thanks!!!. I think unicode ranges work, as it works for the surrogates quite good, and in many tutorials I found you can find them. Of course, in java code, the regex string is double escaped, sorry if it was not clear enough – Issus Jan 07 '15 at 14:18
  • @TimPietzcker you get exactly the same effect with single and double backslashes. – geert3 Jan 07 '15 at 14:22
  • 3
    It looks like `\udbff\udc00` is treated as surrogate pair which means it represents one character. In other words your regex becomes something similar to `[a-c-d]` which is same as `[a-c]|-|d` that is why `-` is also accepted (but I am not sure if I am entire right here, actually I am almost certain that I missed some important fact, or even bug that is why I posted it as comment). Anyway way around of your problem would be wrapping each range with `[..]` like `([[\ud800-\udbff][\udc00-\udfff]])` – Pshemo Jan 07 '15 at 14:50
  • pshemo, that could be the case. Actually if I use "([\\ud800-\\udbff[\\udc00-\\udfff]])" the - is not removed. A bit Weird is the fact that instead of the surrogates I get ?, I'll think about this detail, it could be just that now it is removing only one of the chars in the pair, instead of both, due to the efect of the surrogate you mentioned (it was removing both as they were inside the range, as - was, but now just the low surrogate is?). – Issus Jan 07 '15 at 15:11
  • So, Pshemo, if you add this as a response, I would mark it as a valid response, if you do not mind. Is there anything else I should do to set this as answered or something? – Issus Jan 07 '15 at 15:16
  • @Pshemo: Java regex treats them as surrogate pair since Java 6, IIRC. I need to refer to the source code for earlier versions. And wrapping them like that won't work, since Pattern matches by code point internally, at least in Oracle implementation. – nhahtdh Jan 07 '15 at 15:33
  • @nhahtdh And there goes +1 for you. I knew my suspicion was lacking so I didn't post it as answer. I really need to read more about surrogate pairs someday. – Pshemo Jan 07 '15 at 17:06

2 Answers2

9

Overview and assumption

Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

Behind the scene

Assuming that you are running your regex on Oracle's implementation, your regex

"([\ud800-\udbff\udc00-\udfff])"

is compiled as such:

StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
  Pattern.union. A ∪ B:
    Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
    BitClass. Match any of these 1 character(s):
      [U+002D]
  SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match

The character class is parsed as \ud800-\udbff\udc00, -, \udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

Wrong solution

There is no point in writing:

"[\ud800-\udbff][\udc00-\udfff]"

Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

Solution

If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

The regex above compiles to:

StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
  Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
  Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match

Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")

Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.


From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");

This regex also compiles to the same structure as above.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
1

If you make the range

[\ud800-\udfff]

or

[\ud800-\udbff\udbff-\udfff]

it will leave the hyphen untouched. Seems like a bug to me.

Note there is no reason for the double range, in your example \udc00 is just the next code point after \udbff so you could skip that. If you make the two ranges overlap one or more code points, it works again, but you could just as well leave it out (see my first example above).

geert3
  • 7,086
  • 1
  • 33
  • 49
  • I already tested that one :-), to show you the "problem" with it, Here is an example with some surrogates: String a1= "text hypen - and " with your approach, you won't get the hypen removed, but instead of surrogates you will get ?, exactly "Text - and ? ? ? ? ? ? ? " in this example. – Issus Jan 07 '15 at 14:28
  • Are you sure you are representing these surrogate pairs with the specified range (because I am not ;-) Surrogate fall in \u2b100000 to \u2b10ffff. You can paste the in your range and it will work) – geert3 Jan 07 '15 at 14:40
  • 1
    Note that this doesn't mean that there isn't something suspicious going on in the original question's range: hyphen shouldn't be replaced there. – geert3 Jan 07 '15 at 14:43
  • Yes, I think that the surrogates becoming ? does not mean it fails, but just the opposite :-) But I think the explanation is that in the Pshemo comment above (In the question coments). So your response is useful, but does not "respond" the question (the question was why ;-) ) Thank you very much, for your responses. – Issus Jan 07 '15 at 15:15