-2

I have a string that contains a lot of text. There's some weird characters in it like the following: █ ✖ ✔ ♫ ♬ ▬ ★

This is just a small portion of what I have found so far. I tried using the replaceAll method but it doesn't seem to work. Is there a collection of all these types of characters somewhere, or even better yet, a library that is able to remove them?

RoboticR
  • 121
  • 2
  • 6
  • 3
    You should specify the - sample input string, expected output string and your code. – TheLostMind Mar 11 '16 at 14:06
  • 6
    Please define what you mean by "non-standard". These seem to be pretty standard Unicode characters to me. – biziclop Mar 11 '16 at 14:06
  • 4
    "it doesn't seem to work" is not a problem description. What did you try to do, and how *precisely* did it not work? – Raedwald Mar 11 '16 at 14:07
  • 1
    Define "weird". Do you think that they are text that someone intended to create and you just don't happen to want, or are they the consequence of text corruption? – Tom Blodget Mar 11 '16 at 18:12

2 Answers2

2

Iterate over characters and check each whether it belongs to some category you define as "standard" (here such categories are: alphabetic, digit, whitespace, or modifier applied to previously accepted character):

static String standartize(String s) {
    if (s == null) return null;
    StringBuilder sb = new StringBuilder();
    boolean based = false;    // is previous character accepted base for modifier?
    int c;
    for (int i = 0; i < s.length(); i += Character.charCount(c)) {
        c = Character.codePointAt(s, i);            
        if (based && Character.getType(c) == Character.MODIFIER_SYMBOL) {  
            sb.appendCodePoint(c);               
        } else if (Character.isAlphabetic(c) || Character.isDigit(c)) {
            sb.appendCodePoint(c);
            based = true;
        } else if (Character.isWhitespace(c)) {
            sb.appendCodePoint(c);
            based = false;
        } else {
            based = false;
        }
    }
    return sb.toString();
}

You can add/remove checks in else if to widen/narrow range of characters you consider "standard": Character has many static isXxxx() methods to test if a character belongs to some category.

Please notice that iterated are not char items, but int codepoints. This is made to process not only UTF-16 chars, but surrogate pairs as well.

Alex Salauyou
  • 14,185
  • 5
  • 45
  • 67
  • 1
    This won't work with Unicode codepoints that are encoded in two UTF-16 code units (`char`). Nor will it retain "combining character" codepoints that might not be considered "weird" when combined with a preceding base character.. – Tom Blodget Mar 11 '16 at 18:16
  • @TomBlodget well, you're right. I updated the answer to handle surrogate pairs and modifier characters as well. Thank you for valuable notice. – Alex Salauyou Mar 11 '16 at 20:33
0

If you want only ASCII Characters in your string, you can loop through the length of the string and check wether ASCII value is between 65 - 90(A-Z) or 97 - 122(a-z) or 48-57(0 - 9)

Sachin
  • 3,350
  • 2
  • 17
  • 29
  • 3
    If you are not sure about the question, you should not try to answer it. You should leave a comment (as you can see) asking for clarification. – TheLostMind Mar 11 '16 at 14:25