210

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

▓ railway??
→ Cats and dogs
I'm on 
Apples ⚛ 
✅ Vi sign
♛ I'm the king ♛ 
Corée ♦ du Nord ☁  (French)
 gjør at både ◄╗ (Norwegian)
Star me ★
Star ⭐ once more
早上好 ♛ (Chinese)
Καλημέρα ✂ (Greek)
another ✓ sign ✓
добрай раніцы ✪ (Belarus)
◄ शुभ प्रभात ◄ (Hindi)
✪ ✰ ❈ ❧ Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.❉

...and many more of these.

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

I tried to clean the signs using the EmojiParser library:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ♦ sign is the only one I found till now that it removed. Other signs such as ✪ ❉ ★ ✰ ❈ ❧ ✂ ❋ ⓡ ✿ ♛ are not removed.

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?

riorio
  • 6,500
  • 7
  • 47
  • 100
  • 95
    what you want to keep? – Youcef LAIDANI Mar 27 '18 at 10:07
  • 31
    Two problems: What is EmojiParser? Doesn't seem to be part of a standard library, so this mention is not very helpful. And what characters exactly do you want to filter? You say "many more of this kind", but there are many character groups and families. We need to know more about your criteria. – Markus Fischer Mar 27 '18 at 10:08
  • 3
    You'll need to identify the character ranges you want to keep (or the ones you want to remove), perhaps with the help of the various utilities here: http://unicode.org/cldr/utility/ You'll need to handle the fact that Java strings aren't strings of code points, they're strings of UTF-16 code units and so a single character (code point) may be encoded as two Java `char`s (called a "surrogate pair"). There are similar questions here on site about doing that which deal with those issues. – T.J. Crowder Mar 27 '18 at 10:10
  • 1
    @DavidFoerster - here are some of the icons that EmojiParser is not removing: ✪ ❉ ★ ✰ ❈ ❧ ✂ ❋ ⓡ ✿ ♛ . It did remove the following: ♦. The only one I found till now that it removed. – riorio Mar 27 '18 at 12:22
  • 7
    Note that all of your symbols above are not emojis in the [official list](http://unicode.org/Public/emoji/11.0/emoji-sequences.txt) except ✂ black scissors 0x2702: [✪ circled white star 0x272A, ❉ balloon-spoked asterisk 0x2749, ★ black star 0x2605, ✰ shadowed white star 0x2730, ❈ heavy sparkle 0x2748, ❧ rotated floral heart bullet 0x2767, ❋ heavy eight teardrop-spoked propeller asterisk 0x274B, ⓡ circled latin small letter r 0x24E1, ✿ black florette 0x273F, ♛ black chess queen 0x265B](https://en.wikipedia.org/wiki/Emoji#Unicode_blocks) – phuclv Mar 27 '18 at 13:23
  • 6
    @LưuVĩnhPhúc: Of the 7 in the q, four are classed as emojis by Unicode.org (see "Emoji: Yes"): [ U+1F525](http://unicode.org/cldr/utility/character.jsp?a=1F525), [⚛ U+269B](http://unicode.org/cldr/utility/character.jsp?a=269B), [✅ U+2705](http://unicode.org/cldr/utility/character.jsp?a=2705), and [⭐ U+2B50](http://unicode.org/cldr/utility/character.jsp?a=2B50). Three are not: [→ U+2192](http://unicode.org/cldr/utility/character.jsp?a=2192), [♛ U+265B](http://unicode.org/cldr/utility/character.jsp?a=265B), and [★ U+2605](http://unicode.org/cldr/utility/character.jsp?a=2605). – T.J. Crowder Mar 27 '18 at 13:58
  • 1
    @T.J.Crowder yes most of the ones in the question are emojis but I'm specifically talking about the OP's comment – phuclv Mar 27 '18 at 14:43
  • 1
    @LưuVĩnhPhúc: Ah! Indeed, only two of them (not none) he said weren't removed in [the comment](https://stackoverflow.com/questions/49510006/remove-and-other-such-signs-from-java-string?noredirect=1#comment86030011_49510006) are emojis: [ U+1F525](http://unicode.org/cldr/utility/character.jsp?a=1F525) and [✂ U+2702](http://unicode.org/cldr/utility/character.jsp?a=2702). – T.J. Crowder Mar 27 '18 at 15:44
  • 4
    What about [Combining characters](https://en.wikipedia.org/wiki/Combining_character) and [control characters](https://en.wikipedia.org/wiki/Unicode_control_characters) what should happen with them? – Oleg Mar 27 '18 at 18:00
  • 129
    IDK what your motivations behind this are, but if it's too filter text input: don't. I'm tired of being forced to use a-zA-Z. Let me write in my native language, or emojis, or whatever I want. Do I really want me calendar appointment to be called "‍♂️"? Yes, yes I do. Now get out of my way. – Alexander Mar 27 '18 at 18:14
  • 19
    Please clarify what exactly you want to keep and remove. On the surface the question appears to be clear but because of the complexity of Unicode it is not and because of that it's impossible to provide a good answer. – Oleg Mar 27 '18 at 18:16
  • 12
    this seems like a strange thing to want to do when it destroys the meaning of at least one of your examples? – Eevee Mar 27 '18 at 22:49
  • Please understand we get some nasty users that don't stop using lots of cucumber and water droplet emojis. So we should remove some specific combinations to make sure our community doesn't turn into a very unpleasant place. – Ezequiel Adrian Sep 17 '22 at 21:00

7 Answers7

321

Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter,"");

So:

  • [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s] is a range representing all numeric (\\p{N}), letter (\\p{L}), mark (\\p{M}), punctuation (\\p{P}), whitespace/separator (\\p{Z}), other formatting (\\p{Cf}) and other characters above U+FFFF in Unicode (\\p{Cs}), and newline (\\s) characters. \\p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
  • The ^ in the regex character set negates the match.

Example:

String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。";
System.out.print(str.replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]",""));
// Output:
//   "hello world _# 皆さん、こんにちは! 私はジョンと申します。"

If you need more information, check out the Java documentation for regexes.

Nick Bull
  • 9,518
  • 6
  • 36
  • 58
  • 4
    The obvious gap between ASCII alphanumeric characters and emoji is accentated and non-latin letters. Without the OP's input on these we don't know whether this is a good answer (not my DV though) – Chris H Mar 27 '18 at 15:55
  • 4
    Yeah I'm curious as to why this would possibly get downvoted. The second I saw this question, a regular expression was the absolute first thing that came to mind (P.S. since he's looking for standard characters and punctuation, I'd use something like `[^\w\^\-\[\]\.!@#$%&*\(\)/+'":;~?,]` but that's just me being robust and trying collect all typical characters that aren't symbols). Upvoted because this is definitely a potential solution. If he wants to add some other language characters, he can add them to the expression as necessary. – Chris Mar 27 '18 at 16:05
  • 15
    @Chris great punctuation regex example, looks extensive enough to me for some cases. Also maybe people aren't reading the whole answer then - **as stated at the bottom of the answer, `p{L}` handles non-English alphabetical characters**. I hope it's understood that I can't list extensively through every non-English alphabet in my answer as that would be impractically verbose. – Nick Bull Mar 27 '18 at 16:09
  • To keep all letters, numbers, punctuation, and spaces, you probably want something like `"[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}]"`. Or, the converse, `"[\\p{S}\\p{C}]"`. – VGR Mar 27 '18 at 17:01
  • 1
    What about punctuation? Also a small typo, missing "]" in the regex. – Oleg Mar 27 '18 at 20:08
  • 13
    This. Please and thank you. Don't try to *forbid* characters that cause you problems; decide what characters you *allow* and encode that. Then your code has a clearly defined set of test cases. – jpmc26 Mar 27 '18 at 20:57
  • @Acccumulation Good catch, I was at work typing that up really quick and forgot to add `\s` after `\w`. You can can capture newlines as well by adding `\r\n`. – Chris Mar 27 '18 at 22:07
  • @Chris the character class `\p{Punct}` can also be used for punctuation. Sorry guys, answering the question from my phone so I'm potentially a little less vigilant than earlier with edits – Nick Bull Mar 27 '18 at 22:45
  • @NickBull Is this java based alongside the p{L} class you mentioned before? I haven’t worked with Java too much (.NET and C++ Developer mainly and I’m _almost_ positive I haven’t seen those yet but that may just be me overlooking them). – Chris Mar 27 '18 at 22:46
  • 1
    @Chris it is, I think C# has \p{P} for a punctuation match. I know for POSIX there's the equivalent [:punct:] as well for shell scripts – Nick Bull Mar 27 '18 at 23:43
  • 1
    The `^` inside character set negates the whole set in any regex flavour. I think you meant _The `^` in **character-set** negates the match._ (instead of Java) – hjpotter92 Mar 28 '18 at 04:31
  • This turns `皆さん、こんにちは! 私はジョンと申します。` into `皆さんこんにちは私はジョンと申します`. @VGR's works correctly for Japanese, but I don't know if it works correctly for removing emoticons. – ProgrammingLlama Mar 28 '18 at 07:23
  • 1
    Actually, as zwol points out in a comment on another answer, `\\p{Cf}` should also be preserved, as format characters are part of valid text. – VGR Mar 28 '18 at 11:41
  • 1
    @VGR I've updated with a better set of Regex character categories that work for that example. Please feel free to edit w improvements – Nick Bull Mar 28 '18 at 12:54
  • 1
    Without `\p{M}`, combining marks will be removed from the text and may leave the text hard to read or outright non-sense in languages which use accent. – nhahtdh Mar 28 '18 at 14:55
  • 2
    I suggest `"[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\s]"`. This allows the general categories Letter, Mark, Number, Punctuation, Separator, and "Other, Format", as well as whitespace characters such as tab and newline. – Sean Van Gorder Mar 28 '18 at 17:15
  • @SeanVanGorder I've accepted your proposal, minus the `\\s` as `\\p{Z}` represents all whitespace and separator characters already. – Nick Bull Mar 28 '18 at 17:24
  • @NickBull No, control characters like tab and newline are in the category "Cc - Other, Control". There's no single category for whitespace. – Sean Van Gorder Mar 28 '18 at 17:29
  • 1
    @NickBull One more addition, `\\p{Cs}` "Other, Surrogate" seems to be necessary for code points above `U+FFFF`. That would make it `"[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]"`. – Sean Van Gorder Mar 28 '18 at 17:43
  • @SeanVanGorder great stuff, added. – Nick Bull Mar 28 '18 at 17:48
  • 2
    To preserve all the punctuation and currency symbols, too, I use: `"[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\p{Sc}\\p{Punct}\\s]"` – Mike Sickler Jan 31 '19 at 03:09
  • @MikeSickler Is there a difference between `p{P}` and `p{Punct}`? I couldn't find any good explanations on Google (but I didn't try very hard either :p) – Nick Bull Jan 31 '19 at 08:57
  • Adding ```\\p{M}``` to the whitelist still left unwanted wandering marks in my strings. Using ```Normalizer.normalize(aString, Form.NFC)``` before the whitelist solved it for me. It's hard to be 100% sure, but at least for diacritics found in Romance languages (such as "ãéíóúãâäàçñ") it seems to work. – GFonte Nov 09 '21 at 14:52
  • @NickBull @SeanVanGorder `\\p{Cs}` needs to be removed from above regex, otherwise emoji like ✨ that consist of a surrogate pair will not be fully removed. Proof: https://regex101.com/r/Ap0VrZ/1 – edwardmp Apr 14 '23 at 19:07
83

I'm not super into Java, so I won't try to write example code inline, but the way I would do this is to check what Unicode calls "the general category" of each character. There are a couple letter and punctuation categories.

You can use Character.getType to find the general category of a given character. You should probably retain those characters that fall in these general categories:

COMBINING_SPACING_MARK
CONNECTOR_PUNCTUATION
CURRENCY_SYMBOL
DASH_PUNCTUATION
DECIMAL_DIGIT_NUMBER
ENCLOSING_MARK
END_PUNCTUATION
FINAL_QUOTE_PUNCTUATION
FORMAT
INITIAL_QUOTE_PUNCTUATION
LETTER_NUMBER
LINE_SEPARATOR
LOWERCASE_LETTER
MATH_SYMBOL
MODIFIER_LETTER
MODIFIER_SYMBOL
NON_SPACING_MARK
OTHER_LETTER
OTHER_NUMBER
OTHER_PUNCTUATION
PARAGRAPH_SEPARATOR
SPACE_SEPARATOR
START_PUNCTUATION
TITLECASE_LETTER
UPPERCASE_LETTER

(All of the characters you listed as specifically wanting to remove have general category OTHER_SYMBOL, which I did not include in the above category whitelist.)

Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380
  • 1
    FORMAT (Cf) should be preserved also; this includes the clustering and directional overrides, without which it is impossible to write certain (unusual, admittedly) words in some languages. – zwol Mar 27 '18 at 18:39
  • @zwol Thanks for the details! I'll add it to the list. – Daniel Wagner Mar 27 '18 at 18:41
  • 30
    This is the future-proof answer. Regardless of future updates to the Unicode standard, including/excluding characters based on their categories means that individual parsing of characters and the maintenance of a list is unnecessary. Of course, cursory testing of text in different languages (e.g. Chinese, Arabic etc.) should be done to ensure that the filtered categories match the text required to be allowed in the target environment. – CJBS Mar 27 '18 at 20:32
  • 3
    Oh, another gotcha I should have thought of yesterday: TAB, CR, and LF are all general category Cc (Java's CONTROL). Those need to be specially whitelisted, since you almost certainly _don't_ want to allow most of the legacy control characters. – zwol Mar 28 '18 at 14:29
  • @CJBS The problem with this approach is that it has only been partially implemented in Java. For example, `Character.getType()` won't tell you whether your `char` (or `int` code point since the method is overloaded) is, say, an emoticon, or a musical symbol, or an emoji character, etc. If you have a simple use case it might be fine to go down this path - it's certainly an elegant approach that is easy to comprehend - but be aware that it might break if requirements change. – skomisa Dec 24 '19 at 06:36
49

Based on Full Emoji List, v11.0 you have 1644 different Unicode code points to remove. For example is on this list as U+2705.

Having the full list of emojis you need to filter them out using code points. Iterating over single char or byte won't work as single code point can span multiple bytes. Because Java uses UTF-16 emojis will usually take two chars.

String input = "ab✅cd";
for (int i = 0; i < input.length();) {
  int cp = input.codePointAt(i);
  // filter out if matches
  i += Character.charCount(cp); 
}

Mapping from Unicode code point U+2705 to Java int is straightforward:

int viSign = 0x2705;

or since Java supports Unicode Strings:

int viSign = "✅".codePointAt(0);
Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • 29
    Very useful list. Interesting that something called EmojiParser with a method called removeAllEmojis fails to handle these... :-) – T.J. Crowder Mar 27 '18 at 10:20
  • 4
    `codePointAt` presumably is taking linear time, so the whole loop would have quadratic complexity? – Bergi Mar 27 '18 at 11:29
  • @Bergi you can stream codepoints using `String.chars()`, that's basically the same. `int[] codepoints = input.chars().filter(this::isAcceptableCodepoint).toArray();return new String(codepoints, 0 , codepoints.length);` – Olivier Grégoire Mar 27 '18 at 11:32
  • 7
    @Bergi: No, since `input.codePointAt` only looks at up to 2 characters at most which is a constant upper bound. Also (the newly added) `i += Character.charCount(cp)` skips over all characters that `input.codePointAt` inspected (minus 1 in some corner cases). – David Foerster Mar 27 '18 at 11:36
  • @DavidFoerster Ah, I see, it's `codePointAtCharIndex` not `nthCodePoint` :-) – Bergi Mar 27 '18 at 11:37
  • 6
    @OlivierGrégoire: `String.chars()` streams over characters not codepoints. There's a separate method [`String.codePoints()`](https://docs.oracle.com/javase/10/docs/api/java/lang/String.html#codePoints()) for that. – David Foerster Mar 27 '18 at 11:37
  • @Berji (not to pile on but) The key point is that codePointAt takes the UTF-16 code unit offset so that's direct access into the char sequence, which thankfully is an array. – Tom Blodget Mar 27 '18 at 11:37
  • @DavidFoerster My bad, but the idea stays the same with your correction in mind. – Olivier Grégoire Mar 27 '18 at 11:37
  • 5
    There are at least two problems here: you are using a "closed" list of emojis, so each year you have to extend it (but this probably isn't easily solvabile), and this code won't probably work correctly with codepoints sequences (see for example https://unicode.org/Public/emoji/11.0/emoji-zwj-sequences.txt) – xanatos Mar 27 '18 at 11:43
  • 49
    This is basically the same approach as used by EmojiParser and it will soon fail for the same reason. New emojis are relatively frequently added to the Unicode character database and if you are now implementing a solution using the currently defined 1644 emojis for a negative rule set, the implementation will fail as soon as new emojis become available. – jarnbjo Mar 27 '18 at 12:04
  • 3
    @T.J.Crowder because [they're not emojis (according to Unicode standard)](https://stackoverflow.com/questions/49510006/remove-and-other-such-signs-from-java-string#comment86032854_49510006) – phuclv Mar 27 '18 at 13:25
  • @LưuVĩnhPhúc: [Some aren't, more than half of them are](https://stackoverflow.com/questions/49510006/remove-and-other-such-signs-from-java-string#comment86034611_49510006). But if the OP only tested with ones that weren't, fair enough the lib didn't pick them up... – T.J. Crowder Mar 27 '18 at 13:58
  • 2
    This answer seems fairly far from a working solution at the moment. It's not trivial to gather all the emojis into an appropriate format. – Bernhard Barker Mar 27 '18 at 19:07
  • 5
    This will litter your strings with orphaned zero-width joiners. – Randy the Dev Mar 28 '18 at 04:23
21

ICU4J is your friend.

UCharacter.hasBinaryProperty(UProperty.EMOJI);

Remember to keep your version of icu4j up to date and note this will only filter out official Unicode emoji, not symbol characters. Combine with filtering out other character types as desired.

More information: http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#EMOJI

Daniel F
  • 509
  • 4
  • 7
  • 1
    Until Java is updated to include Emoji binary property, I guess this would be a good solution. The library needs to be updated often for the newly added codepoints, though. – nhahtdh Mar 28 '18 at 15:41
10

I gave some examples below, and thought that Latin is enough, but...

Is there a way to remove all these signs from the input string and keeping only the letters & punctuation in the different languages?

After editing, developed a new solution, using the Character.getType method, and that appears to be the best shot at this.

package zmarcos.emoji;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class TestEmoji {

    public static void main(String[] args) {
        String[] arr = {"Remove ✅, , ✈ , ♛ and other such signs from Java string",
            "→ Cats and dogs",
            "I'm on ",
            "Apples ⚛ ",
            "✅ Vi sign",
            "♛ I'm the king ♛ ",
            "Star me ★",
            "Star ⭐ once more",
            "早上好 ♛",
            "Καλημέρα ✂"};
        System.out.println("---only letters and spaces alike---\n");
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> Character.isLetter(cp) || Character.isWhitespace(cp)).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }

        System.out.println("\n---unicode blocks white---\n");
        Set<Character.UnicodeBlock> whiteList = new HashSet<>();
        whiteList.add(Character.UnicodeBlock.BASIC_LATIN);
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> whiteList.contains(Character.UnicodeBlock.of(cp))).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }

        System.out.println("\n---unicode blocks black---\n");
        Set<Character.UnicodeBlock> blackList = new HashSet<>();        
        blackList.add(Character.UnicodeBlock.EMOTICONS);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_TECHNICAL);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_ARROWS);
        blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS);
        blackList.add(Character.UnicodeBlock.ALCHEMICAL_SYMBOLS);
        blackList.add(Character.UnicodeBlock.TRANSPORT_AND_MAP_SYMBOLS);
        blackList.add(Character.UnicodeBlock.GEOMETRIC_SHAPES);
        blackList.add(Character.UnicodeBlock.DINGBATS);
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> !blackList.contains(Character.UnicodeBlock.of(cp))).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }
        System.out.println("\n---category---\n");
        int[] category = {Character.COMBINING_SPACING_MARK, Character.COMBINING_SPACING_MARK, Character.CONNECTOR_PUNCTUATION, /*Character.CONTROL,*/ Character.CURRENCY_SYMBOL,
            Character.DASH_PUNCTUATION, Character.DECIMAL_DIGIT_NUMBER, Character.ENCLOSING_MARK, Character.END_PUNCTUATION, Character.FINAL_QUOTE_PUNCTUATION,
            /*Character.FORMAT,*/ Character.INITIAL_QUOTE_PUNCTUATION, Character.LETTER_NUMBER, Character.LINE_SEPARATOR, Character.LOWERCASE_LETTER,
            /*Character.MATH_SYMBOL,*/ Character.MODIFIER_LETTER, /*Character.MODIFIER_SYMBOL,*/ Character.NON_SPACING_MARK, Character.OTHER_LETTER, Character.OTHER_NUMBER,
            Character.OTHER_PUNCTUATION, /*Character.OTHER_SYMBOL,*/ Character.PARAGRAPH_SEPARATOR, /*Character.PRIVATE_USE,*/
            Character.SPACE_SEPARATOR, Character.START_PUNCTUATION, /*Character.SURROGATE,*/ Character.TITLECASE_LETTER, /*Character.UNASSIGNED,*/ Character.UPPERCASE_LETTER};
        Arrays.sort(category);
        for (String input : arr) {
            int[] filtered = input.codePoints().filter((cp) -> Arrays.binarySearch(category, Character.getType(cp)) >= 0).toArray();
            String result = new String(filtered, 0, filtered.length);
            System.out.println(input);
            System.out.println(result);
        }
    }

}

Output:

---only letters and spaces alike---

Remove ✅, , ✈ , ♛ and other such signs from Java string
Remove      and other such signs from Java string
→ Cats and dogs
 Cats and dogs
I'm on 
Im on 
Apples ⚛ 
Apples  
✅ Vi sign
 Vi sign
♛ I'm the king ♛ 
 Im the king  
Star me ★
Star me 
Star ⭐ once more
Star  once more
早上好 ♛
早上好 
Καλημέρα ✂
Καλημέρα 

---unicode blocks white---

Remove ✅, , ✈ , ♛ and other such signs from Java string
Remove , ,  ,  and other such signs from Java string
→ Cats and dogs
 Cats and dogs
I'm on 
I'm on 
Apples ⚛ 
Apples  
✅ Vi sign
 Vi sign
♛ I'm the king ♛ 
 I'm the king  
Star me ★
Star me 
Star ⭐ once more
Star  once more
早上好 ♛

Καλημέρα ✂


---unicode blocks black---

Remove ✅, , ✈ , ♛ and other such signs from Java string
Remove , ,  ,  and other such signs from Java string
→ Cats and dogs
→ Cats and dogs
I'm on 
I'm on 
Apples ⚛ 
Apples  
✅ Vi sign
 Vi sign
♛ I'm the king ♛ 
 I'm the king  
Star me ★
Star me 
Star ⭐ once more
Star  once more
早上好 ♛
早上好 
Καλημέρα ✂
Καλημέρα 

---category---

Remove ✅, , ✈ , ♛ and other such signs from Java string
Remove , ,  ,  and other such signs from Java string
→ Cats and dogs
 Cats and dogs
I'm on 
I'm on 
Apples ⚛ 
Apples  
✅ Vi sign
 Vi sign
♛ I'm the king ♛ 
 I'm the king  
Star me ★
Star me 
Star ⭐ once more
Star  once more
早上好 ♛
早上好 
Καλημέρα ✂
Καλημέρα 

The code works by streaming the String to code-points. Then using lambdas to filter characters into a int array, then we convert the array to String.

The letters and spaces are using using the Character methods to filter, not good with punctuation. Failed attempt.

The unicode blocks white filter using the unicode blocks the programmer specifies as allowed. Failed attempt.

The unicode blocks black filter using the unicode blocks the programmer specifies as not allowed. Failed attempt.

The category filter using the static method Character.getType. The programmer can define in the category array what types are allowed. WORKS.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Marcos Zolnowski
  • 2,751
  • 1
  • 24
  • 29
  • `import java.lang.Character.UnicodeBlock;`, then `Character.UnicodeBlock` -> `UnicodeBlock`. – Bernhard Barker Mar 27 '18 at 19:20
  • All your ways failed the tests. – Oleg Mar 27 '18 at 19:53
  • @Oleg no, look again, the `white list` example. – Marcos Zolnowski Mar 27 '18 at 20:01
  • Something must be wrong with my eyes or my monitor, I can't see is 早上好 and Καλημέρα – Oleg Mar 27 '18 at 20:04
  • @Oleg Maybe there is a simple way, can you give a try at it? – Marcos Zolnowski Mar 27 '18 at 21:30
  • Using character type is fine, it's either that or regex. The problem is OP never clarified his question, your current answer works with the examples he gave so I guess it's good enough. – Oleg Mar 27 '18 at 23:03
  • 4
    Note that the Java language is a little slow supporting newer Unicode versions... For example Java 10 supports only Unicode 8 (so its character classes describe only Unicode 8 characters)... So many emojis aren't presente (see https://docs.oracle.com/javase/10/docs/api/java/lang/Character.html, *Character information is based on the Unicode Standard, version 8.0.0.*) – xanatos Mar 28 '18 at 11:46
1

Try this project simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

Simple with:

EmojiUtils.removeEmoji(str)
coder4
  • 319
  • 2
  • 4
-4

Use a jQuery plugin called RM-Emoji. Here's how it works:

$('#text').remove('emoji').fast()

This is the fast mode that may miss some emojis as it uses heuristic algorithms for finding emojis in text. Use the .full() method to scan entire string and remove all emojis guaranteed.

Adil B
  • 14,635
  • 11
  • 60
  • 78