Remove non-standard characters from a String in java

Question

I have a string that contains a lot of text. There's some weird characters in it like the following: █ ✖ ✔ ♫ ♬ ▬ ★

This is just a small portion of what I have found so far. I tried using the replaceAll method but it doesn't seem to work. Is there a collection of all these types of characters somewhere, or even better yet, a library that is able to remove them?

You should specify the - sample input string, expected output string and your code. — TheLostMind, Mar 11 '16 at 14:06
Please define what you mean by "non-standard". These seem to be pretty standard Unicode characters to me. — biziclop, Mar 11 '16 at 14:06
"it doesn't seem to work" is not a problem description. What did you try to do, and how *precisely* did it not work? — Raedwald, Mar 11 '16 at 14:07
Define "weird". Do you think that they are text that someone intended to create and you just don't happen to want, or are they the consequence of text corruption? — Tom Blodget, Mar 11 '16 at 18:12

Alex Salauyou · Answer 1 · 2016-03-11T20:47:28.983

Iterate over characters and check each whether it belongs to some category you define as "standard" (here such categories are: alphabetic, digit, whitespace, or modifier applied to previously accepted character):

static String standartize(String s) {
    if (s == null) return null;
    StringBuilder sb = new StringBuilder();
    boolean based = false;    // is previous character accepted base for modifier?
    int c;
    for (int i = 0; i < s.length(); i += Character.charCount(c)) {
        c = Character.codePointAt(s, i);            
        if (based && Character.getType(c) == Character.MODIFIER_SYMBOL) {  
            sb.appendCodePoint(c);               
        } else if (Character.isAlphabetic(c) || Character.isDigit(c)) {
            sb.appendCodePoint(c);
            based = true;
        } else if (Character.isWhitespace(c)) {
            sb.appendCodePoint(c);
            based = false;
        } else {
            based = false;
        }
    }
    return sb.toString();
}

You can add/remove checks in else if to widen/narrow range of characters you consider "standard": Character has many static isXxxx() methods to test if a character belongs to some category.

Please notice that iterated are not char items, but int codepoints. This is made to process not only UTF-16 chars, but surrogate pairs as well.

This won't work with Unicode codepoints that are encoded in two UTF-16 code units (`char`). Nor will it retain "combining character" codepoints that might not be considered "weird" when combined with a preceding base character.. — Tom Blodget, Mar 11 '16 at 18:16
@TomBlodget well, you're right. I updated the answer to handle surrogate pairs and modifier characters as well. Thank you for valuable notice. — Alex Salauyou, Mar 11 '16 at 20:33

score 0 · Answer 2 · answered Mar 11 '16 at 14:20

0

If you want only ASCII Characters in your string, you can loop through the length of the string and check wether ASCII value is between 65 - 90(A-Z) or 97 - 122(a-z) or 48-57(0 - 9)

answered Mar 11 '16 at 14:20

Sachin

3,350
2
17
29

3

If you are not sure about the question, you should not try to answer it. You should leave a comment (as you can see) asking for clarification. – TheLostMind Mar 11 '16 at 14:25

Remove non-standard characters from a String in java

2 Answers2