Using Java regexes to match a range of Unicode code points _outside_ the BMP: it is possible at all?

Question

While totally unrelated at first, this question made me wonder...

Java's regexes are based on Strings; Strings are sequences (arrays) of chars; and chars are ultimately UTF-16 code units.

The latter means that a single char can match any Unicode code point inside the BMP, ie from U+0000 to U+FFFF.

Outside the BMP however, two chars are required for a single code point (one for the leading surrogate, another for the trailing surrogate); from what I can see, apart from a dedicated grammar engine, I don't see a way for Java regexes (as defined by java.util.regex.Pattern) to define "character classes" for such code points, since there is no String literal for code points outside the BMP.

Notwithstanding that code can be written to produce regexes (well, string literals used as regexes) for such ranges, is there an existing mechanism in Pattern which is not documented and allows to do that?

Yes, if you have Java 7 or up, there's a `\x` option (look at the Javadoc for Pattern). Otherwise, you need surrogate pairs. Someone asked this a few days ago - I'll see if I can find it. — Dawood ibn Kareem, Nov 12 '14 at 22:26
@DavidWallace uhwell, I have read the doc and missed that... Make it an answer! — fge, Nov 12 '14 at 22:27
Have a look at the excellent answer (not my one) here. http://stackoverflow.com/a/26838867. Your question isn't quite the same, but the answer is the same. Should I close as a duplicate? — Dawood ibn Kareem, Nov 12 '14 at 22:28
@DavidWallace indeed, an excellent answer; in fact I didn't even suspect that specifying a leading/trailing surrogate pair in a range would succeed! — fge, Nov 12 '14 at 22:32

score 7 · Answer 1 · edited May 23 '17 at 10:26

7

OK, so, answer to self; data extracted from this question and the associated answer which @DavidWallace pointed to.

It is possible. To paraphrase the answer, in such a character class as:

"[\uD83D\uDE01-\uD83D\uDE4F]"

the Java regex engine will be smart enough to notice that you specify a surrogate pair on both ends of the interval, and "compile" the regex accordingly.

In addition, starting with Java 7, you can also use \x{foo} where foo is the hexadecimal representation of the code point. Not forgetting the quoting necessary in Java string literals, the above can therefore be written:

"[\\x{1F601}-\\x{1F64F}]"

edited May 23 '17 at 10:26

Community

1
1

answered Nov 12 '14 at 22:55

fge

119,121
33
254
329

1

`smart enough to notice` To be precise, in Oracle's implementation, the engine converts the String into an array of `int` internally before it is compiled proper. Therefore, it simply sees `[codepoint]-[codepoint]` when compiling the character class, like the case with BMP. – nhahtdh Nov 13 '14 at 02:36
@nhahtdh meh, blind copy and paste, didn't pay attention... Thanks for noticing! – fge Nov 13 '14 at 02:54
Note for future readers: The Java regex engine isn't smart enough to handle this correctly if you negate the character class. – flodin Mar 09 '21 at 09:24
@flodin But it does work with the `\\x{1F601}` syntax. – SiXoS Mar 31 '22 at 09:24

Using Java regexes to match a range of Unicode code points _outside_ the BMP: it is possible at all?

1 Answers1

Linked