How to remove all no printable characters + Emoji from a string?

Question

I want to remove all non printable characters + all Emoji from my String.

I tried with that but it doesn't work properly for Emoji:

public static String removeAllNoAsciiChars(String str) {
        if (!TextUtils.isEmpty(str)) {
            str = str.replaceAll("\\p{C}", "");
        }
        return str;
    }

Examples:

"L'alphabet est génial !"

Final result expected: "L'alphabet est génial !"

"Ça c'est du cœur ❤️ :) !"

Final result expected: "Ça c'est du cœur :) !"

It is maybe better to specify what you want to keep. Btw. the method is badly named because you keep much more than just ASCII. — Henry, Jan 02 '18 at 09:08
The emoji part: [*What is the regex to extract all the emojis from a string?*](https://stackoverflow.com/questions/24840667/what-is-the-regex-to-extract-all-the-emojis-from-a-string/44056668#44056668) Did you do a *thorough* search? This is the second hit for ["\[java\] remove emoji"](/search?q=%5Bjava%5D+remove+emoji). *(not my dv)* — T.J. Crowder, Jan 02 '18 at 09:08

Andreas · Answer 1 · 2018-01-02T13:12:41.910

The \\p{C} regex takes care of all non-printable characters. Be aware that this includes tabs and newlines.

As for Emoji characters, that a bit more complicated. You could just match the newer Emoji characters in Unicode, i.e. Unicode Block 'Emoticons' (U+1F600 to U+1F64F), but that's not really all the Emoji characters, e.g. ❤ 'HEAVY BLACK HEART' (U+2764) is not in that range.

If you look at those Emoji characters, e.g. 'GRINNING FACE' (U+1F600), you'll see that it belongs to Unicode Category "Symbol, Other [So]", which consists of 5855 characters. If you're ok removing all those, that would definitely be the easiest solution.

Your text included a red heart (❤️), not a black heart (❤), and that is done in Unicode by adding a variation selector after the black heart, e.g. a 'VARIATION SELECTOR-16' (U+FE0F) in this case. There are 256 variation selectors, and they are all in category Mark, Nonspacing [Mn], but you probably don't want to remove all 1763 of those, so you need to remove the 2 ranges of variation selectors, i.e. U+FE00 to U+FE0F (selectors 1-16) and U+E0100 to U+E01EF (selectors 17-256).

After that, you may or may not want to reduce consecutive spaces to a single space.

str = str.replaceAll("[\\p{C}\\p{So}\uFE00-\uFE0F\\x{E0100}-\\x{E01EF}]+", "")
         .replaceAll(" {2,}", " ");

@YCF_L The `\s` regex is same as `[ \t\n\x0B\f\r]`, but only the space still exists, because the 5 control characters got removed by the `\p{C}` match. I even warned about that in the first paragraph. Since `\s` is then reduced to just being the same as a space, I'm reverting your update. — Andreas, Jan 02 '18 at 13:12

How to remove all no printable characters + Emoji from a string?

1 Answers1