How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java?
Asked
Active
Viewed 4,045 times
2 Answers
26
To remove all non-BMP characters, the following should work:
String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");

James Van Huis
- 5,481
- 1
- 26
- 25
-
Have you actually tested this? Because your character range includes the surrogate range used to construct non-BMP codepoints. – Anon Oct 27 '10 at 17:32
-
3@Anon: As you pointed out in your own answer, regexps are evaluated at the level of codepoints, not codeunits, so it doesn't see surrogates. – axtavt Oct 27 '10 at 17:35
-
Yes, this has been tested with non-BMP characters. – James Van Huis Oct 27 '10 at 17:39
-
@axtavt - actually, I assumed that regex was evaluated at the character level, and that the non-BMP codepoint was simply translated into surrogates. – Anon Oct 27 '10 at 17:42
-
1@Anon - More info on supplementary characters in java: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ – James Van Huis Oct 27 '10 at 17:44
-
WARNING: this substitution may introduce new astral characters by pairing previously unpaired surrogates, which may or may not be acceptable for the original question: try `String inputString = "\uD800\uD800\uDC00\uDC00";`. – Feb 15 '16 at 12:15
-
Side note #1: for `\uD800\uD800\uDC00\uDC00` example @Anon's `StringBuilder` solution produces exactly the same output as regex solution. Side note #2: applying this filtering twice (or more) may be required to ged rid of non-BMP chars completely. – Nikita Bosik Jan 06 '18 at 16:52
4
Are you looking for specific characters or all characters outside the BMP?
If the former, you can use a StringBuilder
to construct a string containing code points from the higher planes, and regex will work as expected:
String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());
Matcher matcher = regex.matcher(test);
matcher.find();
System.out.println(matcher.start());
If you're looking to remove all non-BMP characters from a string, then I'd use StringBuilder
directly rather than regex:
StringBuilder sb = new StringBuilder(test.length());
for (int ii = 0 ; ii < test.length() ; )
{
int codePoint = test.codePointAt(ii);
if (codePoint > 0xFFFF)
{
ii += Character.charCount(codePoint);
}
else
{
sb.appendCodePoint(codePoint);
ii++;
}
}

Anon
- 2,654
- 16
- 10