While totally unrelated at first, this question made me wonder...
Java's regexes are based on String
s; String
s are sequences (arrays) of char
s; and char
s are ultimately UTF-16 code units.
The latter means that a single char
can match any Unicode code point inside the BMP, ie from U+0000 to U+FFFF.
Outside the BMP however, two char
s are required for a single code point (one for the leading surrogate, another for the trailing surrogate); from what I can see, apart from a dedicated grammar engine, I don't see a way for Java regexes (as defined by java.util.regex.Pattern
) to define "character classes" for such code points, since there is no String literal for code points outside the BMP.
Notwithstanding that code can be written to produce regexes (well, string literals used as regexes) for such ranges, is there an existing mechanism in Pattern
which is not documented and allows to do that?