0

I want to remove all non printable characters + all Emoji from my String.

I tried with that but it doesn't work properly for Emoji:

public static String removeAllNoAsciiChars(String str) {
        if (!TextUtils.isEmpty(str)) {
            str = str.replaceAll("\\p{C}", "");
        }
        return str;
    }

Examples:

"L'alphabet est génial !"

Final result expected: "L'alphabet est génial !"

"Ça c'est du cœur ❤️ :) !"

Final result expected: "Ça c'est du cœur :) !"

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
anthony
  • 7,653
  • 8
  • 49
  • 101
  • 3
    It is maybe better to specify what you want to keep. Btw. the method is badly named because you keep much more than just ASCII. – Henry Jan 02 '18 at 09:08
  • 2
    The emoji part: [*What is the regex to extract all the emojis from a string?*](https://stackoverflow.com/questions/24840667/what-is-the-regex-to-extract-all-the-emojis-from-a-string/44056668#44056668) Did you do a *thorough* search? This is the second hit for ["\[java\] remove emoji"](/search?q=%5Bjava%5D+remove+emoji). *(not my dv)* – T.J. Crowder Jan 02 '18 at 09:08

1 Answers1

7

The \\p{C} regex takes care of all non-printable characters. Be aware that this includes tabs and newlines.

As for Emoji characters, that a bit more complicated. You could just match the newer Emoji characters in Unicode, i.e. Unicode Block 'Emoticons' (U+1F600 to U+1F64F), but that's not really all the Emoji characters, e.g. ❤ 'HEAVY BLACK HEART' (U+2764) is not in that range.

If you look at those Emoji characters, e.g. 'GRINNING FACE' (U+1F600), you'll see that it belongs to Unicode Category "Symbol, Other [So]", which consists of 5855 characters. If you're ok removing all those, that would definitely be the easiest solution.

Your text included a red heart (❤️), not a black heart (❤), and that is done in Unicode by adding a variation selector after the black heart, e.g. a 'VARIATION SELECTOR-16' (U+FE0F) in this case. There are 256 variation selectors, and they are all in category Mark, Nonspacing [Mn], but you probably don't want to remove all 1763 of those, so you need to remove the 2 ranges of variation selectors, i.e. U+FE00 to U+FE0F (selectors 1-16) and U+E0100 to U+E01EF (selectors 17-256).

After that, you may or may not want to reduce consecutive spaces to a single space.

str = str.replaceAll("[\\p{C}\\p{So}\uFE00-\uFE0F\\x{E0100}-\\x{E01EF}]+", "")
         .replaceAll(" {2,}", " ");
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • 1
    @YCF_L The `\s` regex is same as `[ \t\n\x0B\f\r]`, but only the space still exists, because the 5 control characters got removed by the `\p{C}` match. I even warned about that in the first paragraph. Since `\s` is then reduced to just being the same as a space, I'm reverting your update. – Andreas Jan 02 '18 at 13:12
  • oops, my bad, I'm sorry, nice answer :) – Youcef LAIDANI Jan 02 '18 at 13:14