Iterate over characters and check each whether it belongs to some category you define as "standard" (here such categories are: alphabetic, digit, whitespace, or modifier applied to previously accepted character):
static String standartize(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
boolean based = false; // is previous character accepted base for modifier?
int c;
for (int i = 0; i < s.length(); i += Character.charCount(c)) {
c = Character.codePointAt(s, i);
if (based && Character.getType(c) == Character.MODIFIER_SYMBOL) {
sb.appendCodePoint(c);
} else if (Character.isAlphabetic(c) || Character.isDigit(c)) {
sb.appendCodePoint(c);
based = true;
} else if (Character.isWhitespace(c)) {
sb.appendCodePoint(c);
based = false;
} else {
based = false;
}
}
return sb.toString();
}
You can add/remove checks in else if
to widen/narrow range of characters you consider "standard": Character
has many static isXxxx()
methods to test if a character belongs to some category.
Please notice that iterated are not char
items, but int
codepoints. This is made to process not only UTF-16 chars, but surrogate pairs as well.