71

I have a String encoded in UTF-8. For example:

Thats a nice joke  

I have to extract all the emojis present in the sentence. And the emoji could be any

When this sentence is viewed in terminal using command less text.txt it is viewed as:

Thats a nice joke <U+1F606><U+1F606><U+1F606> <U+1F61B>

This is the corresponding UTF code for the emoji. All the codes for emojis can be found at emojitracker.

For the purpose of finding all the occurances, I used a regular expression pattern (<U\+\w+?>) but it didnt work for the UTF-8 encoded string.

Following is my code:

    String s="Thats a nice joke  ";
    Pattern pattern = Pattern.compile("(<U\\+\\w+?>)");
    Matcher matcher = pattern.matcher(s);
    List<String> matchList = new ArrayList<String>();

    while (matcher.find()) {
        matchList.add(matcher.group());
    }

    for(int i=0;i<matchList.size();i++){
        System.out.println(matchList.get(i));

    }

This pdf says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So I want to capture any character lying within this range.

vishalaksh
  • 2,054
  • 5
  • 27
  • 45
  • 2
    That `` string is specific to `less` - also, your solution idea would also capture just about any other unicode character. The only real solution would be to have a list of all unicode code points corresponding to emojis. – Drew McGowen Jul 19 '14 at 13:04
  • 1
    You'll have to find a list of all of the emoji characters (code points) you want to find, they're [spread over many different Unicode blocks](http://www.unicode.org/faq/emoji_dingbats.html#2.2). [This PDF](http://www.unicode.org/charts/PDF/U1F300.pdf) has a "good sample" (according to the first link)... – T.J. Crowder Jul 19 '14 at 13:07
  • @T.J.Crowder the pdf that you just mentioned says `Range: 1F300–1F5FF` for `Miscellaneous Symbols and Pictographs`. So lets say I want to capture any character lying within this range. Now what to do? – vishalaksh Jul 19 '14 at 13:16
  • 1
    I came here trying to find a regex that I can paste into Sublime Text to find emojis. No luck. – adib Nov 14 '16 at 01:59
  • You can use Character class http://stackoverflow.com/questions/28366172/check-if-letter-is-emoji/41147459#41147459 – user2474486 Dec 14 '16 at 16:32
  • @vishalaksh One question which comes to my mind is --"why would you require that?" I mean what use case does that help in..? thanks!! – eRaisedToX Jun 06 '17 at 12:44
  • **"String encoded in UTF-8"**: Maybe so but that would be outside of Java text datatypes. In Java, a string is UTF-16. Both UTF-16 and UTF-8 are encodings for the Unicode character set. UTF-8 is not relevant the code you've shown. When you use the UTF-8 encoding in Java, you're dealing with byte[]. – Tom Blodget Dec 10 '17 at 16:40

18 Answers18

53

Using emoji-java i've wrote a simple method that removes all emojis including fitzpatrick modifiers. Requires an external library but easier to maintain than those monster regexes.

Use:

String input = "A string with a \uD83D\uDC66\uD83C\uDFFFfew emojis!";
String result = EmojiParser.removeAllEmojis(input);

emoji-java maven installation:

<dependency>
  <groupId>com.vdurmont</groupId>
  <artifactId>emoji-java</artifactId>
  <version>3.1.3</version>
</dependency>

gradle:

implementation 'com.vdurmont:emoji-java:3.1.3'

EDIT: previously submitted answer was pulled into emoji-java source code.

gidim
  • 2,314
  • 20
  • 23
  • 4
    I love answers like these. This worked like a charm. Thanks! – TheKingInTheNorth Jan 19 '16 at 16:03
  • I also used this library to remove emojis and it worked perfectly. One thing, the code snippet is outdated and did not work for me with the latest version (threw some pattern exception) , in the documentation it is recommended to use `EmojiParser#removeAllEmojis(String)` and that indeed works smoothly. – Yonatan Wilkof Jun 02 '16 at 05:44
  • If you are using this. here is a link to the jar: https://github.com/vdurmont/emoji-java/releases and this is a link to the dependency: http://mvnrepository.com/artifact/org.json/json/20080701 – Whitecat Oct 12 '16 at 18:25
  • 1
    @gidim, please update the version of the dependencies to 3.1.3. Version 2.0.1 that you listed doesn't have EmojiParser.removeAllEmojis(String input) Other than that, thumbs up for the great library! – Bruno Carrier Oct 31 '16 at 20:14
  • 1
    @BrunoCarrier thanks! updated. btw i'm not author of the library. I just wrote the emoji removal function. – gidim Nov 01 '16 at 16:55
  • @gidim, unfortunately this is not removing character like (Mahjong Tile Plum). Any reasons why this happens? – azizbekian Dec 24 '19 at 09:58
38

the pdf that you just mentioned says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?

Okay, but I will just note that the emoji in your question are outside that range! :-)

The fact that these are above 0xFFFF complicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)

U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00; U+1F5FF ends up being \uD83D\uDDFF. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.

Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the end — I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83C followed by anything in the range \uDF00-\uDFFF (inclusive), or \uD83D followed by anything in the range \uDC00-\uDDFF (inclusive).

So armed with that knowledge, in theory we could now write a pattern:

// This is wrong, keep reading
Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");

That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C, and the second group for the pairs starting with \uD83D.

But that fails (doesn't find anything). I'm fairly sure it's because we're trying to specify half of a surrogate pair in various places:

Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])");
// Half of a pair --------------^------^------^-----------^------^------^

We can't just split up surrogate pairs like that, they're called surrogate pairs for a reason. :-)

Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through char arrays.

char arrays hold UTF-16 values, so we can find those half-pairs in the data if we look for it the hard way:

String s = new StringBuilder()
                .append("Thats a nice joke ")
                .appendCodePoint(0x1F606)
                .appendCodePoint(0x1F606)
                .appendCodePoint(0x1F606)
                .append(" ")
                .appendCodePoint(0x1F61B)
                .toString();
char[] chars = s.toCharArray();
int index;
char ch1;
char ch2;

index = 0;
while (index < chars.length - 1) { // -1 because we're looking for two-char-long things
    ch1 = chars[index];
    if ((int)ch1 == 0xD83C) {
        ch2 = chars[index+1];
        if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {
            System.out.println("Found emoji at index " + index);
            index += 2;
            continue;
        }
    }
    else if ((int)ch1 == 0xD83D) {
        ch2 = chars[index+1];
        if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {
            System.out.println("Found emoji at index " + index);
            index += 2;
            continue;
        }
    }
    ++index;
}

Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFF instead of 0xDDFF, it will. No idea if that would also include non-emojis, though.)


Source of my program to find out what the surrogate ranges were:

public class FindRanges {

    public static void main(String[] args) {
        char last0 = '\0';
        char last1 = '\0';
        for (int x = 0x1F300; x <= 0x1F5FF; ++x) {
            char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();
            if (chars[0] != last0) {
                if (last0 != '\0') {
                    System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
                }
                System.out.print("\\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \\u" + Integer.toHexString((int)chars[1]).toUpperCase());
                last0 = chars[0];
            }
            last1 = chars[1];
        }
        if (last0 != '\0') {
            System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());
        }
    }
}

Output:

\uD83C \uDF00-\uDFFF
\uD83D \uDC00-\uDDFF
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • @purrrminator: See notes about about ranges. The above is just an example handling a specific range, but I warned the OP there were others. – T.J. Crowder Aug 11 '14 at 12:35
21

Had a similar problem. The following served me well and matches surrogate pairs

public class SplitByUnicode {
    public static void main(String[] argv) throws Exception {
        String string = "Thats a nice joke  ";
        System.out.println("Original String:"+string);
        String regexPattern = "[\uD83C-\uDBFF\uDC00-\uDFFF]+";
        byte[] utf8 = string.getBytes("UTF-8");

        String string1 = new String(utf8, "UTF-8");

        Pattern pattern = Pattern.compile(regexPattern);
        Matcher matcher = pattern.matcher(string1);
        List<String> matchList = new ArrayList<String>();

        while (matcher.find()) {
            matchList.add(matcher.group());
        }

        for(int i=0;i<matchList.size();i++){
            System.out.println(i+":"+matchList.get(i));

        }
    }
}

Output is:


Original String:Thats a nice joke  
0:
1:

Found the regex from https://stackoverflow.com/a/24071599/915972

Community
  • 1
  • 1
Karan Ashar
  • 1,392
  • 1
  • 10
  • 23
  • This seems we work quite well, and simple too, if you take out the example Java boilerplate – r3flss ExlUtr Oct 14 '16 at 09:57
  • boilerplate code was just for completeness if any newbie to java wanted to test it :) – Karan Ashar Oct 21 '16 at 23:03
  • 1
    I tried to use `[\uD83C-\uDBFF\uDC00-\uDFFF]+` to remove emojis, and it removed the next character as well `-`. I ended up using `[\uD800\uDC00-\uDBFF\uDFFF]` – mgershen Feb 13 '18 at 13:00
19

Just to use regex to solve it:

s = s.replaceAll("\\p{So}+", "");

You can find it in

http://www.regular-expressions.info/unicode.html

https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#OTHER_SYMBOL


enter image description here

Desgard_Duan
  • 659
  • 1
  • 6
  • 12
12

This worked for me in java 8:

public static String mysqlSafe(String input) {
  if (input == null) return null;
    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < input.length(); i++) {
      if (i < (input.length() - 1)) { // Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
        if (Character.isSurrogatePair(input.charAt(i), input.charAt(i + 1))) {
          i += 1; //also skip the second character of the emoji
          continue;
        }
      }
      sb.append(input.charAt(i));
    }

  return sb.toString();
}
Mike
  • 131
  • 1
  • 5
  • Thank you so much! Pointed me in right direction for what I needed. – HannahCarney Apr 21 '17 at 12:19
  • 1
    This logic is just simply skipping the code points outside of BMP. This may seem okay in some situations, but won't always work properly. First, this won't filter the emoji which reside in the dingbet block, and secondly, this will filter even some rare letters. – Jenix Sep 01 '18 at 19:24
9

you can do it like this

    String s="Thats a nice joke  ";
    Pattern pattern = Pattern.compile("[\ud83c\udc00-\ud83c\udfff]|[\ud83d\udc00-\ud83d\udfff]|[\u2600-\u27ff]",
                                      Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(s);
    List<String> matchList = new ArrayList<String>();

    while (matcher.find()) {
        matchList.add(matcher.group());
    }

    for(int i=0;i<matchList.size();i++){
        System.out.println(matchList.get(i));
    }
Shi Xiangyang
  • 91
  • 1
  • 3
6

The best regex for extracting ALL emoji is this:

(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c[\ude32-\ude3a]|[\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])

It identifies many single-char emoji that the other answers do not account for. For more information about how this regex works, take a look at this post. https://medium.com/@thekevinscott/emojis-in-javascript-f693d0eb79fb#.enomgcu63

  • I get the error *Unclosed character class near index 657* when inputting this into the `Pattern.compile()` method. – Jack Cole May 15 '19 at 18:17
6

There are two ways to solve this sticky problem.

The first one is Using third-party libs like emoji-java and emoji4j. These are mentioned above. You can easily use the method containsEmoji or removesEmoji, etc. And in your own Apps, you need to keep update with these libs.

As for me, I want to find a simple solution to solve this problem.

After a whole day of searching, I've found a magic regex:

"(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)"

which I have tested OK in Java. It perfectly solved my problem.

You can view this on the Github page:

https://github.com/zly394/EmojiRegex

Notes:

The answer which provided by @Eric Nakagawa contains some errors, which cannot be operated properly.

Vensent Wang
  • 181
  • 2
  • 7
  • This captures a lot more than emojiis. If you use this on [Big List of Naughty Strings](https://github.com/minimaxir/big-list-of-naughty-strings) you'll get plenty of non-emoji matches. – Jack Cole May 15 '19 at 23:52
5

Assuming that you are asking for standard Unicode emoji ranges (there are different blocks by vendor) you may consider these three ranges:

  • 0x20a0 - 0x32ff
  • 0x1f000 - 0x1ffff
  • 0xfe4e5 - 0xfe4ee

Besides all the thoughtful explanation that T.J.Crowder has shared with us, needs to be said that beginning with Java 7 is possible to match UTF-16 encoded surrogate pairs with ease.

Take a look at the docs:

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

Nevertheless, if you cannot switch to Java 7, you can extend the valuable UnicodeEscaper provided by Guava.

Here an implementation for the sake of example:

public class SimpleEscaper extends UnicodeEscaper
{
    @Override
    protected char[] escape(int codePoint)
    {
        if (0x1f000 >= codePoint && codePoint <= 0x1ffff)
        {
            return Integer.toHexString(codePoint).toCharArray();
        }

        return Character.toChars(codePoint);
    }
}
Mr.C
  • 61
  • 1
4

Emoji regex

public static final String sEmojiRegex = "(?:[\\u2700-\\u27bf]|" +

        "(?:[\\ud83c\\udde6-\\ud83c\\uddff]){2}|" +
        "[\\ud800\\udc00-\\uDBFF\\uDFFF]|[\\u2600-\\u26FF])[\\ufe0e\\ufe0f]?(?:[\\u0300-\\u036f\\ufe20-\\ufe23\\u20d0-\\u20f0]|[\\ud83c\\udffb-\\ud83c\\udfff])?" +

        "(?:\\u200d(?:[^\\ud800-\\udfff]|" +

        "(?:[\\ud83c\\udde6-\\ud83c\\uddff]){2}|" +
        "[\\ud800\\udc00-\\uDBFF\\uDFFF]|[\\u2600-\\u26FF])[\\ufe0e\\ufe0f]?(?:[\\u0300-\\u036f\\ufe20-\\ufe23\\u20d0-\\u20f0]|[\\ud83c\\udffb-\\ud83c\\udfff])?)*|" +

        "[\\u0023-\\u0039]\\ufe0f?\\u20e3|\\u3299|\\u3297|\\u303d|\\u3030|\\u24c2|[\\ud83c\\udd70-\\ud83c\\udd71]|[\\ud83c\\udd7e-\\ud83c\\udd7f]|\\ud83c\\udd8e|[\\ud83c\\udd91-\\ud83c\\udd9a]|[\\ud83c\\udde6-\\ud83c\\uddff]|[\\ud83c\\ude01-\\ud83c\\ude02]|\\ud83c\\ude1a|\\ud83c\\ude2f|[\\ud83c\\ude32-\\ud83c\\ude3a]|[\\ud83c\\ude50-\\ud83c\\ude51]|\\u203c|\\u2049|[\\u25aa-\\u25ab]|\\u25b6|\\u25c0|[\\u25fb-\\u25fe]|\\u00a9|\\u00ae|\\u2122|\\u2139|\\ud83c\\udc04|[\\u2600-\\u26FF]|\\u2b05|\\u2b06|\\u2b07|\\u2b1b|\\u2b1c|\\u2b50|\\u2b55|\\u231a|\\u231b|\\u2328|\\u23cf|[\\u23e9-\\u23f3]|[\\u23f8-\\u23fa]|\\ud83c\\udccf|\\u2934|\\u2935|[\\u2190-\\u21ff]";

some emojis (1627)

// count = 1627
public static final String sEmojiTest = "☺️☹️☠️✊✌️☝️✋✍️‍♀‍♀‍♀‍♀‍♀️‍♀️‍⚕‍⚕‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍✈‍✈‍‍‍⚖‍⚖‍♀‍♂‍♂‍♂‍♂‍♀‍♂‍♀‍♂‍♂‍♂‍♂‍♂‍♂‍♀‍♀‍❤️‍‍❤️‍‍❤️‍‍‍❤️‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍⛑☂️☘️⭐️✨⚡️☄☀️⛅️☁️⛈☃️⛄️❄️☔️☕️⚽️⚾️⛳️⛸⛷️‍♀️‍♀‍♂‍♀‍♂⛹️‍♀️⛹‍♀‍♂️‍♀️‍♀‍♀‍♀‍♂‍♀‍♀‍♀‍♀‍♂✈️⛵️⛴⚓️⛽️⛲️⛱⛰⛺️⛪️⛩⌚️⌨️☎️⏱⏲⏰⌛️⏳⚖️⚒⛏⚙️⛓⚔️⚰️⚱️⚗️✉️✂️✒️✏️❤️❣️☮️✝️☪️☸️✡️☯️☦️⛎♈️♉️♊️♋️♌️♍️♎️♏️♐️♑️♒️♓️⚛️☢️☣️️️✴️㊙️㊗️️️️❌⭕️⛔️♨️❗️❕❓❔‼️⁉️〽️⚠️⚜️♻️✅️❇️✳️❎Ⓜ️♿️️️ℹ️0️⃣1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣9️⃣#️⃣*️⃣▶️⏸⏯⏹⏺⏭⏮⏩⏪⏫⏬◀️➡️⬅️⬆️⬇️↗️↘️↙️↖️↕️↔️↪️↩️⤴️⤵️➕➖➗✖️™️©️®️〰️➰➿✔️☑️⚪️⚫️▪️▫️◾️◽️◼️◻️⬛️⬜️‍♠️♣️♥️♦️️️️‍⚽️⚾️⛳️⛸⛷️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️️‍♀️‍♂️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♂️‍♂️‍♂️‍♂️‍♂️‍♂️⛹️‍♀️⛹‍♀️⛹‍♀️⛹‍♀️⛹‍♀️⛹‍♀️⛹️⛹⛹⛹⛹⛹‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♂️‍♂️‍♂️‍♂️‍♂️‍♂️️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♂️‍♂️‍♂️‍♂️‍♂️‍♂️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♂️";

function to test emojis

public void checkMatchingEmojis() {

    final Pattern pattern = Pattern.compile(sEmojiRegex);
    final Matcher matcher = pattern.matcher(sEmojiTest);
    int foundEmojiCount = 0;
    while (matcher.find()) {
        System.out.println("Full match: " + matcher.group(0));
        foundEmojiCount++;
    }
    System.out.println("*******************************************");
    System.out.println("Input Emoji count = 1627");
    System.out.println("Captured Emoji count = " + foundEmojiCount);
    System.out.println("*******************************************");

}

Here is the gist, tested on all unicode 10 emojis

Thanks to Kevin Scott for writting greate example

3

You may also use emoji4j library.

String emojiText = "A ,  and a  became friends. For 's birthday party, they all had s, s, s and .";

EmojiUtils.removeAllEmojis(emojiText);//returns "A ,  and a  became friends. For 's birthday party, they all had s, s, s and .
Chaitanya
  • 2,396
  • 4
  • 29
  • 45
2

This is what I use to remove emojis and so far it has shown to allow all other alphabets.

private static String remove_Emojis(String name)
{  

    //we will store all the letters in this array
    ArrayList<Character> nonEmoji = new ArrayList<>();

     // and when we rebuild the name we will put it in here
    String newName = "";


    // we are going to loop through checking each character to see if its an emoji or not
    for (int i = 0; i < name.length(); i++) 
     {

        if (Character.isLetterOrDigit(name.charAt(i)))
        {
            nonEmoji.add(name.charAt(i));
        } 

         else 
          {
             // this is just a 2nd check in case the other method didn't allow some letter
            if (Build.VERSION.SDK_INT > 18)
            {
                if (Character.isAlphabetic(name.charAt(i))) 
                {
                    nonEmoji.add(name.charAt(i));
                }
            }
        }


        if (name.charAt(i) == ' ')// may want to consider adding or '-' or '\''
        {
            nonEmoji.add(i);// just add it
        }

        if (name.charAt(i) == '@' && !name.contains(" "))// I put this in for email addresses
        {
            nonEmoji.add('@');
        }
    }

    // finally just loop through building it back out
    for (int i = 0; i < nonEmoji.size(); i++) {

        newName += nonEmoji.get(i);
    }

    return newName;
}
1

\p{Cs} works well for matching emoji with PCRE regex flavours. Test it at https://regex101.com/r/o69vJJ/1.

The Unicode Character Category is "Other, Surrogate".

ssent1
  • 651
  • 5
  • 4
1

There is a lifesaver project that can help:

https://github.com/mathiasbynens/emoji-test-regex-pattern

emoji-test-regex-pattern offers Java- and JavaScript-compatible regular expression patterns to match all emoji symbols and sequences listed in the emoji-test.txt file provided as part of Unicode® Technical Standard #51.

Moreover, it provides CPP and CSS compatible regular expression patterns. You'll find the generated patterns in the dist folder. All the regular expression patterns are generated based on Unicode data, so all unicode emojis are covered.For Javascript Project just use emoji-regex, which is powered by emoji-test-regex-pattern.

linxie
  • 1,849
  • 15
  • 20
0

You can generate your own regex whenever the spec changes.
This tool (screenshot here).

For utf-8/32 mode (stringed), expanded mode :

"     # Use the 'Mega-Conversion' tool to change into other syntaxes"
"     # -------------------------------------------------------------"
"     "
"     [#*0-9] \\x{FE0F} \\x{20E3}"
"  |  [\\x{A9}\\x{AE}\\x{203C}\\x{2049}\\x{2122}\\x{2139}\\x{2194}-\\x{2199}\\x{21A9}\\x{21AA}\\x{231A}\\x{231B}\\x{2328}\\x{23CF}\\x{23E9}-\\x{23F3}\\x{23F8}-\\x{23FA}\\x{24C2}\\x{25AA}\\x{25AB}\\x{25B6}\\x{25C0}\\x{25FB}-\\x{25FE}\\x{2600}-\\x{2604}\\x{260E}\\x{2611}\\x{2614}\\x{2615}\\x{2618}]"
"  |  \\x{261D} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{2620}\\x{2622}\\x{2623}\\x{2626}\\x{262A}\\x{262E}\\x{262F}\\x{2638}-\\x{263A}\\x{2640}\\x{2642}\\x{2648}-\\x{2653}\\x{265F}\\x{2660}\\x{2663}\\x{2665}\\x{2666}\\x{2668}\\x{267B}\\x{267E}\\x{267F}\\x{2692}-\\x{2697}\\x{2699}\\x{269B}\\x{269C}\\x{26A0}\\x{26A1}\\x{26AA}\\x{26AB}\\x{26B0}\\x{26B1}\\x{26BD}\\x{26BE}\\x{26C4}\\x{26C5}\\x{26C8}\\x{26CE}\\x{26CF}\\x{26D1}\\x{26D3}\\x{26D4}\\x{26E9}\\x{26EA}\\x{26F0}-\\x{26F5}\\x{26F7}\\x{26F8}]"
"  |  \\x{26F9}"
"     (?:"
"          \\x{FE0F} \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{26FA}\\x{26FD}\\x{2702}\\x{2705}\\x{2708}\\x{2709}]"
"  |  [\\x{270A}-\\x{270D}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{270F}\\x{2712}\\x{2714}\\x{2716}\\x{271D}\\x{2721}\\x{2728}\\x{2733}\\x{2734}\\x{2744}\\x{2747}\\x{274C}\\x{274E}\\x{2753}-\\x{2755}\\x{2757}\\x{2763}\\x{2764}\\x{2795}-\\x{2797}\\x{27A1}\\x{27B0}\\x{27BF}\\x{2934}\\x{2935}\\x{2B05}-\\x{2B07}\\x{2B1B}\\x{2B1C}\\x{2B50}\\x{2B55}\\x{3030}\\x{303D}\\x{3297}\\x{3299}\\x{1F004}\\x{1F0CF}\\x{1F170}\\x{1F171}\\x{1F17E}\\x{1F17F}\\x{1F18E}\\x{1F191}-\\x{1F19A}]"
"  |  \\x{1F1E6} [\\x{1F1E8}-\\x{1F1EC}\\x{1F1EE}\\x{1F1F1}\\x{1F1F2}\\x{1F1F4}\\x{1F1F6}-\\x{1F1FA}\\x{1F1FC}\\x{1F1FD}\\x{1F1FF}]"
"  |  \\x{1F1E7} [\\x{1F1E6}\\x{1F1E7}\\x{1F1E9}-\\x{1F1EF}\\x{1F1F1}-\\x{1F1F4}\\x{1F1F6}-\\x{1F1F9}\\x{1F1FB}\\x{1F1FC}\\x{1F1FE}\\x{1F1FF}]"
"  |  \\x{1F1E8} [\\x{1F1E6}\\x{1F1E8}\\x{1F1E9}\\x{1F1EB}-\\x{1F1EE}\\x{1F1F0}-\\x{1F1F5}\\x{1F1F7}\\x{1F1FA}-\\x{1F1FF}]"
"  |  \\x{1F1E9} [\\x{1F1EA}\\x{1F1EC}\\x{1F1EF}\\x{1F1F0}\\x{1F1F2}\\x{1F1F4}\\x{1F1FF}]"
"  |  \\x{1F1EA} [\\x{1F1E6}\\x{1F1E8}\\x{1F1EA}\\x{1F1EC}\\x{1F1ED}\\x{1F1F7}-\\x{1F1FA}]"
"  |  \\x{1F1EB} [\\x{1F1EE}-\\x{1F1F0}\\x{1F1F2}\\x{1F1F4}\\x{1F1F7}]"
"  |  \\x{1F1EC} [\\x{1F1E6}\\x{1F1E7}\\x{1F1E9}-\\x{1F1EE}\\x{1F1F1}-\\x{1F1F3}\\x{1F1F5}-\\x{1F1FA}\\x{1F1FC}\\x{1F1FE}]"
"  |  \\x{1F1ED} [\\x{1F1F0}\\x{1F1F2}\\x{1F1F3}\\x{1F1F7}\\x{1F1F9}\\x{1F1FA}]"
"  |  \\x{1F1EE} [\\x{1F1E8}-\\x{1F1EA}\\x{1F1F1}-\\x{1F1F4}\\x{1F1F6}-\\x{1F1F9}]"
"  |  \\x{1F1EF} [\\x{1F1EA}\\x{1F1F2}\\x{1F1F4}\\x{1F1F5}]"
"  |  \\x{1F1F0} [\\x{1F1EA}\\x{1F1EC}-\\x{1F1EE}\\x{1F1F2}\\x{1F1F3}\\x{1F1F5}\\x{1F1F7}\\x{1F1FC}\\x{1F1FE}\\x{1F1FF}]"
"  |  \\x{1F1F1} [\\x{1F1E6}-\\x{1F1E8}\\x{1F1EE}\\x{1F1F0}\\x{1F1F7}-\\x{1F1FB}\\x{1F1FE}]"
"  |  \\x{1F1F2} [\\x{1F1E6}\\x{1F1E8}-\\x{1F1ED}\\x{1F1F0}-\\x{1F1FF}]"
"  |  \\x{1F1F3} [\\x{1F1E6}\\x{1F1E8}\\x{1F1EA}-\\x{1F1EC}\\x{1F1EE}\\x{1F1F1}\\x{1F1F4}\\x{1F1F5}\\x{1F1F7}\\x{1F1FA}\\x{1F1FF}]"
"  |  \\x{1F1F4} \\x{1F1F2}"
"  |  \\x{1F1F5} [\\x{1F1E6}\\x{1F1EA}-\\x{1F1ED}\\x{1F1F0}-\\x{1F1F3}\\x{1F1F7}-\\x{1F1F9}\\x{1F1FC}\\x{1F1FE}]"
"  |  \\x{1F1F6} \\x{1F1E6}"
"  |  \\x{1F1F7} [\\x{1F1EA}\\x{1F1F4}\\x{1F1F8}\\x{1F1FA}\\x{1F1FC}]"
"  |  \\x{1F1F8} [\\x{1F1E6}-\\x{1F1EA}\\x{1F1EC}-\\x{1F1F4}\\x{1F1F7}-\\x{1F1F9}\\x{1F1FB}\\x{1F1FD}-\\x{1F1FF}]"
"  |  \\x{1F1F9} [\\x{1F1E6}\\x{1F1E8}\\x{1F1E9}\\x{1F1EB}-\\x{1F1ED}\\x{1F1EF}-\\x{1F1F4}\\x{1F1F7}\\x{1F1F9}\\x{1F1FB}\\x{1F1FC}\\x{1F1FF}]"
"  |  \\x{1F1FA} [\\x{1F1E6}\\x{1F1EC}\\x{1F1F2}\\x{1F1F3}\\x{1F1F8}\\x{1F1FE}\\x{1F1FF}]"
"  |  \\x{1F1FB} [\\x{1F1E6}\\x{1F1E8}\\x{1F1EA}\\x{1F1EC}\\x{1F1EE}\\x{1F1F3}\\x{1F1FA}]"
"  |  \\x{1F1FC} [\\x{1F1EB}\\x{1F1F8}]"
"  |  \\x{1F1FD} \\x{1F1F0}"
"  |  \\x{1F1FE} [\\x{1F1EA}\\x{1F1F9}]"
"  |  \\x{1F1FF} [\\x{1F1E6}\\x{1F1F2}\\x{1F1FC}]"
"  |  [\\x{1F201}\\x{1F202}\\x{1F21A}\\x{1F22F}\\x{1F232}-\\x{1F23A}\\x{1F250}\\x{1F251}\\x{1F300}-\\x{1F321}\\x{1F324}-\\x{1F384}]"
"  |  \\x{1F385} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F386}-\\x{1F393}\\x{1F396}\\x{1F397}\\x{1F399}-\\x{1F39B}\\x{1F39E}-\\x{1F3C1}]"
"  |  \\x{1F3C2} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F3C3}\\x{1F3C4}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F3C5}\\x{1F3C6}]"
"  |  \\x{1F3C7} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F3C8}\\x{1F3C9}]"
"  |  \\x{1F3CA}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F3CB}\\x{1F3CC}]"
"     (?:"
"          \\x{FE0F} \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F3CD}-\\x{1F3F0}]"
"  |  \\x{1F3F3}"
"     (?: \\x{FE0F} \\x{200D} \\x{1F308} )?"
"  |  \\x{1F3F4}"
"     (?:"
"          \\x{200D} \\x{2620} \\x{FE0F}"
"       |  \\x{E0067} \\x{E0062}"
"          (?:"
"               \\x{E0065} \\x{E006E} \\x{E0067}"
"            |  \\x{E0073} \\x{E0063} \\x{E0074}"
"            |  \\x{E0077} \\x{E006C} \\x{E0073}"
"          )"
"          \\x{E007F}"
"     )?"
"  |  [\\x{1F3F5}\\x{1F3F7}-\\x{1F440}]"
"  |  \\x{1F441}"
"     (?: \\x{FE0F} \\x{200D} \\x{1F5E8} \\x{FE0F} )?"
"  |  [\\x{1F442}\\x{1F443}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F444}\\x{1F445}]"
"  |  [\\x{1F446}-\\x{1F450}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F451}-\\x{1F465}]"
"  |  [\\x{1F466}\\x{1F467}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F468}"
"     (?:"
"          \\x{200D}"
"          (?:"
"               [\\x{2695}\\x{2696}\\x{2708}] \\x{FE0F}"
"            |  \\x{2764} \\x{FE0F} \\x{200D}"
"               (?: \\x{1F48B} \\x{200D} )?"
"               \\x{1F468}"
"            |  [\\x{1F33E}\\x{1F373}\\x{1F393}\\x{1F3A4}\\x{1F3A8}\\x{1F3EB}\\x{1F3ED}]"
"            |  \\x{1F466}"
"               (?: \\x{200D} \\x{1F466} )?"
"            |  \\x{1F467}"
"               (?: \\x{200D} [\\x{1F466}\\x{1F467}] )?"
"            |  [\\x{1F468}\\x{1F469}] \\x{200D}"
"               (?:"
"                    \\x{1F466}"
"                    (?: \\x{200D} \\x{1F466} )?"
"                 |  \\x{1F467}"
"                    (?: \\x{200D} [\\x{1F466}\\x{1F467}] )?"
"               )"
"            |  [\\x{1F4BB}\\x{1F4BC}\\x{1F527}\\x{1F52C}\\x{1F680}\\x{1F692}\\x{1F9B0}-\\x{1F9B3}]"
"          )"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?:"
"               \\x{200D}"
"               (?:"
"                    [\\x{2695}\\x{2696}\\x{2708}] \\x{FE0F}"
"                 |  [\\x{1F33E}\\x{1F373}\\x{1F393}\\x{1F3A4}\\x{1F3A8}\\x{1F3EB}\\x{1F3ED}\\x{1F4BB}\\x{1F4BC}\\x{1F527}\\x{1F52C}\\x{1F680}\\x{1F692}\\x{1F9B0}-\\x{1F9B3}]"
"               )"
"          )?"
"     )?"
"  |  \\x{1F469}"
"     (?:"
"          \\x{200D}"
"          (?:"
"               [\\x{2695}\\x{2696}\\x{2708}] \\x{FE0F}"
"            |  \\x{2764} \\x{FE0F} \\x{200D}"
"               (?: \\x{1F48B} \\x{200D} )?"
"               [\\x{1F468}\\x{1F469}]"
"            |  [\\x{1F33E}\\x{1F373}\\x{1F393}\\x{1F3A4}\\x{1F3A8}\\x{1F3EB}\\x{1F3ED}]"
"            |  \\x{1F466}"
"               (?: \\x{200D} \\x{1F466} )?"
"            |  \\x{1F467}"
"               (?: \\x{200D} [\\x{1F466}\\x{1F467}] )?"
"            |  \\x{1F469} \\x{200D}"
"               (?:"
"                    \\x{1F466}"
"                    (?: \\x{200D} \\x{1F466} )?"
"                 |  \\x{1F467}"
"                    (?: \\x{200D} [\\x{1F466}\\x{1F467}] )?"
"               )"
"            |  [\\x{1F4BB}\\x{1F4BC}\\x{1F527}\\x{1F52C}\\x{1F680}\\x{1F692}\\x{1F9B0}-\\x{1F9B3}]"
"          )"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?:"
"               \\x{200D}"
"               (?:"
"                    [\\x{2695}\\x{2696}\\x{2708}] \\x{FE0F}"
"                 |  [\\x{1F33E}\\x{1F373}\\x{1F393}\\x{1F3A4}\\x{1F3A8}\\x{1F3EB}\\x{1F3ED}\\x{1F4BB}\\x{1F4BC}\\x{1F527}\\x{1F52C}\\x{1F680}\\x{1F692}\\x{1F9B0}-\\x{1F9B3}]"
"               )"
"          )?"
"     )?"
"  |  [\\x{1F46A}-\\x{1F46D}]"
"  |  \\x{1F46E}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F46F}"
"     (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"  |  \\x{1F470} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F471}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F472} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F473}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F474}-\\x{1F476}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F477}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F478} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F479}-\\x{1F47B}]"
"  |  \\x{1F47C} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F47D}-\\x{1F480}]"
"  |  [\\x{1F481}\\x{1F482}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F483} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F484}"
"  |  \\x{1F485} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F486}\\x{1F487}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F488}-\\x{1F4A9}]"
"  |  \\x{1F4AA} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F4AB}-\\x{1F4FD}\\x{1F4FF}-\\x{1F53D}\\x{1F549}-\\x{1F54E}\\x{1F550}-\\x{1F567}\\x{1F56F}\\x{1F570}\\x{1F573}]"
"  |  \\x{1F574} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F575}"
"     (?:"
"          \\x{FE0F} \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F576}-\\x{1F579}]"
"  |  \\x{1F57A} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F587}\\x{1F58A}-\\x{1F58D}]"
"  |  [\\x{1F590}\\x{1F595}\\x{1F596}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F5A4}\\x{1F5A5}\\x{1F5A8}\\x{1F5B1}\\x{1F5B2}\\x{1F5BC}\\x{1F5C2}-\\x{1F5C4}\\x{1F5D1}-\\x{1F5D3}\\x{1F5DC}-\\x{1F5DE}\\x{1F5E1}\\x{1F5E3}\\x{1F5E8}\\x{1F5EF}\\x{1F5F3}\\x{1F5FA}-\\x{1F644}]"
"  |  [\\x{1F645}-\\x{1F647}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F648}-\\x{1F64A}]"
"  |  \\x{1F64B}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F64C} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F64D}\\x{1F64E}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F64F} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F680}-\\x{1F6A2}]"
"  |  \\x{1F6A3}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F6A4}-\\x{1F6B3}]"
"  |  [\\x{1F6B4}-\\x{1F6B6}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F6B7}-\\x{1F6BF}]"
"  |  \\x{1F6C0} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F6C1}-\\x{1F6C5}\\x{1F6CB}]"
"  |  \\x{1F6CC} [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F6CD}-\\x{1F6D2}\\x{1F6E0}-\\x{1F6E5}\\x{1F6E9}\\x{1F6EB}\\x{1F6EC}\\x{1F6F0}\\x{1F6F3}-\\x{1F6F9}\\x{1F910}-\\x{1F917}]"
"  |  [\\x{1F918}-\\x{1F91C}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F91D}"
"  |  [\\x{1F91E}\\x{1F91F}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  [\\x{1F920}-\\x{1F925}]"
"  |  \\x{1F926}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F927}-\\x{1F92F}]"
"  |  [\\x{1F930}-\\x{1F936}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F937}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F938}\\x{1F939}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  \\x{1F93A}"
"  |  \\x{1F93C}"
"     (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"  |  [\\x{1F93D}\\x{1F93E}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F940}-\\x{1F945}\\x{1F947}-\\x{1F970}\\x{1F973}-\\x{1F976}\\x{1F97A}\\x{1F97C}-\\x{1F9A2}\\x{1F9B0}-\\x{1F9B4}]"
"  |  [\\x{1F9B5}\\x{1F9B6}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F9B7}"
"  |  [\\x{1F9B8}\\x{1F9B9}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F9C0}-\\x{1F9C2}\\x{1F9D0}]"
"  |  [\\x{1F9D1}-\\x{1F9D5}] [\\x{1F3FB}-\\x{1F3FF}]?"
"  |  \\x{1F9D6}"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F9D7}-\\x{1F9DD}]"
"     (?:"
"          \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F}"
"       |  [\\x{1F3FB}-\\x{1F3FF}]"
"          (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"     )?"
"  |  [\\x{1F9DE}\\x{1F9DF}]"
"     (?: \\x{200D} [\\x{2640}\\x{2642}] \\x{FE0F} )?"
"  |  [\\x{1F9E0}-\\x{1F9FF}]"

For utf-16 mode (stringed), compressed mode :

"[#*0-9]\\uFE0F\\u20E3|[\\u00A9\\u00AE\\u203C\\u2049\\u2122\\u2139\\u2"
"194-\\u2199\\u21A9\\u21AA\\u231A\\u231B\\u2328\\u23CF\\u23E9-\\u23F3\\"
"u23F8-\\u23FA\\u24C2\\u25AA\\u25AB\\u25B6\\u25C0\\u25FB-\\u25FE\\u260"
"0-\\u2604\\u260E\\u2611\\u2614\\u2615\\u2618]|\\u261D(?:\\uD83C[\\uDF"
"FB-\\uDFFF])?|[\\u2620\\u2622\\u2623\\u2626\\u262A\\u262E\\u262F\\u26"
"38-\\u263A\\u2640\\u2642\\u2648-\\u2653\\u265F\\u2660\\u2663\\u2665\\u"
"2666\\u2668\\u267B\\u267E\\u267F\\u2692-\\u2697\\u2699\\u269B\\u269C\\"
"u26A0\\u26A1\\u26AA\\u26AB\\u26B0\\u26B1\\u26BD\\u26BE\\u26C4\\u26C5\\"
"u26C8\\u26CE\\u26CF\\u26D1\\u26D3\\u26D4\\u26E9\\u26EA\\u26F0-\\u26F5"
"\\u26F7\\u26F8]|\\u26F9(?:\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640"
"\\u2642]\\uFE0F)?|\\uFE0F\\u200D[\\u2640\\u2642]\\uFE0F)?|[\\u26FA\\u"
"26FD\\u2702\\u2705\\u2708\\u2709]|[\\u270A-\\u270D](?:\\uD83C[\\uDFF"
"B-\\uDFFF])?|[\\u270F\\u2712\\u2714\\u2716\\u271D\\u2721\\u2728\\u273"
"3\\u2734\\u2744\\u2747\\u274C\\u274E\\u2753-\\u2755\\u2757\\u2763\\u27"
"64\\u2795-\\u2797\\u27A1\\u27B0\\u27BF\\u2934\\u2935\\u2B05-\\u2B07\\u"
"2B1B\\u2B1C\\u2B50\\u2B55\\u3030\\u303D\\u3297\\u3299]|\\uD83C(?:[\\u"
"DC04\\uDCCF\\uDD70\\uDD71\\uDD7E\\uDD7F\\uDD8E\\uDD91-\\uDD9A]|\\uDDE"
"6\\uD83C[\\uDDE8-\\uDDEC\\uDDEE\\uDDF1\\uDDF2\\uDDF4\\uDDF6-\\uDDFA\\u"
"DDFC\\uDDFD\\uDDFF]|\\uDDE7\\uD83C[\\uDDE6\\uDDE7\\uDDE9-\\uDDEF\\uDD"
"F1-\\uDDF4\\uDDF6-\\uDDF9\\uDDFB\\uDDFC\\uDDFE\\uDDFF]|\\uDDE8\\uD83C"
"[\\uDDE6\\uDDE8\\uDDE9\\uDDEB-\\uDDEE\\uDDF0-\\uDDF5\\uDDF7\\uDDFA-\\u"
"DDFF]|\\uDDE9\\uD83C[\\uDDEA\\uDDEC\\uDDEF\\uDDF0\\uDDF2\\uDDF4\\uDDF"
"F]|\\uDDEA\\uD83C[\\uDDE6\\uDDE8\\uDDEA\\uDDEC\\uDDED\\uDDF7-\\uDDFA]"
"|\\uDDEB\\uD83C[\\uDDEE-\\uDDF0\\uDDF2\\uDDF4\\uDDF7]|\\uDDEC\\uD83C["
"\\uDDE6\\uDDE7\\uDDE9-\\uDDEE\\uDDF1-\\uDDF3\\uDDF5-\\uDDFA\\uDDFC\\uD"
"DFE]|\\uDDED\\uD83C[\\uDDF0\\uDDF2\\uDDF3\\uDDF7\\uDDF9\\uDDFA]|\\uDD"
"EE\\uD83C[\\uDDE8-\\uDDEA\\uDDF1-\\uDDF4\\uDDF6-\\uDDF9]|\\uDDEF\\uD8"
"3C[\\uDDEA\\uDDF2\\uDDF4\\uDDF5]|\\uDDF0\\uD83C[\\uDDEA\\uDDEC-\\uDDE"
"E\\uDDF2\\uDDF3\\uDDF5\\uDDF7\\uDDFC\\uDDFE\\uDDFF]|\\uDDF1\\uD83C[\\u"
"DDE6-\\uDDE8\\uDDEE\\uDDF0\\uDDF7-\\uDDFB\\uDDFE]|\\uDDF2\\uD83C[\\uD"
"DE6\\uDDE8-\\uDDED\\uDDF0-\\uDDFF]|\\uDDF3\\uD83C[\\uDDE6\\uDDE8\\uDD"
"EA-\\uDDEC\\uDDEE\\uDDF1\\uDDF4\\uDDF5\\uDDF7\\uDDFA\\uDDFF]|\\uDDF4\\"
"uD83C\\uDDF2|\\uDDF5\\uD83C[\\uDDE6\\uDDEA-\\uDDED\\uDDF0-\\uDDF3\\uD"
"DF7-\\uDDF9\\uDDFC\\uDDFE]|\\uDDF6\\uD83C\\uDDE6|\\uDDF7\\uD83C[\\uDD"
"EA\\uDDF4\\uDDF8\\uDDFA\\uDDFC]|\\uDDF8\\uD83C[\\uDDE6-\\uDDEA\\uDDEC"
"-\\uDDF4\\uDDF7-\\uDDF9\\uDDFB\\uDDFD-\\uDDFF]|\\uDDF9\\uD83C[\\uDDE6"
"\\uDDE8\\uDDE9\\uDDEB-\\uDDED\\uDDEF-\\uDDF4\\uDDF7\\uDDF9\\uDDFB\\uDD"
"FC\\uDDFF]|\\uDDFA\\uD83C[\\uDDE6\\uDDEC\\uDDF2\\uDDF3\\uDDF8\\uDDFE\\"
"uDDFF]|\\uDDFB\\uD83C[\\uDDE6\\uDDE8\\uDDEA\\uDDEC\\uDDEE\\uDDF3\\uDD"
"FA]|\\uDDFC\\uD83C[\\uDDEB\\uDDF8]|\\uDDFD\\uD83C\\uDDF0|\\uDDFE\\uD8"
"3C[\\uDDEA\\uDDF9]|\\uDDFF\\uD83C[\\uDDE6\\uDDF2\\uDDFC]|[\\uDE01\\uD"
"E02\\uDE1A\\uDE2F\\uDE32-\\uDE3A\\uDE50\\uDE51\\uDF00-\\uDF21\\uDF24-"
"\\uDF84]|\\uDF85(?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDF86-\\uDF93\\uDF9"
"6\\uDF97\\uDF99-\\uDF9B\\uDF9E-\\uDFC1]|\\uDFC2(?:\\uD83C[\\uDFFB-\\u"
"DFFF])?|[\\uDFC3\\uDFC4](?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\"
"uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDFC5\\uDFC6"
"]|\\uDFC7(?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDFC8\\uDFC9]|\\uDFCA(?:\\"
"u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2"
"640\\u2642]\\uFE0F)?)?|[\\uDFCB\\uDFCC](?:\\uD83C[\\uDFFB-\\uDFFF]("
"?:\\u200D[\\u2640\\u2642]\\uFE0F)?|\\uFE0F\\u200D[\\u2640\\u2642]\\uF"
"E0F)?|[\\uDFCD-\\uDFF0]|\\uDFF3(?:\\uFE0F\\u200D\\uD83C\\uDF08)?|\\u"
"DFF4(?:\\u200D\\u2620\\uFE0F|\\uDB40\\uDC67\\uDB40\\uDC62\\uDB40(?:\\"
"uDC65\\uDB40\\uDC6E\\uDB40\\uDC67|\\uDC73\\uDB40\\uDC63\\uDB40\\uDC74"
"|\\uDC77\\uDB40\\uDC6C\\uDB40\\uDC73)\\uDB40\\uDC7F)?|[\\uDFF5\\uDFF7"
"-\\uDFFF])|\\uD83D(?:[\\uDC00-\\uDC40]|\\uDC41(?:\\uFE0F\\u200D\\uD8"
"3D\\uDDE8\\uFE0F)?|[\\uDC42\\uDC43](?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\"
"uDC44\\uDC45]|[\\uDC46-\\uDC50](?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDC"
"51-\\uDC65]|[\\uDC66\\uDC67](?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDC68(?"
":\\u200D(?:[\\u2695\\u2696\\u2708]\\uFE0F|\\u2764\\uFE0F\\u200D\\uD83"
"D(?:\\uDC8B\\u200D\\uD83D)?\\uDC68|\\uD83C[\\uDF3E\\uDF73\\uDF93\\uDF"
"A4\\uDFA8\\uDFEB\\uDFED]|\\uD83D(?:\\uDC66(?:\\u200D\\uD83D\\uDC66)?"
"|\\uDC67(?:\\u200D\\uD83D[\\uDC66\\uDC67])?|[\\uDC68\\uDC69]\\u200D\\"
"uD83D(?:\\uDC66(?:\\u200D\\uD83D\\uDC66)?|\\uDC67(?:\\u200D\\uD83D["
"\\uDC66\\uDC67])?)|[\\uDCBB\\uDCBC\\uDD27\\uDD2C\\uDE80\\uDE92])|\\uD"
"83E[\\uDDB0-\\uDDB3])|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D(?:[\\u2695"
"\\u2696\\u2708]\\uFE0F|\\uD83C[\\uDF3E\\uDF73\\uDF93\\uDFA4\\uDFA8\\uD"
"FEB\\uDFED]|\\uD83D[\\uDCBB\\uDCBC\\uDD27\\uDD2C\\uDE80\\uDE92]|\\uD8"
"3E[\\uDDB0-\\uDDB3]))?)?|\\uDC69(?:\\u200D(?:[\\u2695\\u2696\\u2708"
"]\\uFE0F|\\u2764\\uFE0F\\u200D\\uD83D(?:\\uDC8B\\u200D\\uD83D)?[\\uDC"
"68\\uDC69]|\\uD83C[\\uDF3E\\uDF73\\uDF93\\uDFA4\\uDFA8\\uDFEB\\uDFED]"
"|\\uD83D(?:\\uDC66(?:\\u200D\\uD83D\\uDC66)?|\\uDC67(?:\\u200D\\uD83"
"D[\\uDC66\\uDC67])?|\\uDC69\\u200D\\uD83D(?:\\uDC66(?:\\u200D\\uD83D"
"\\uDC66)?|\\uDC67(?:\\u200D\\uD83D[\\uDC66\\uDC67])?)|[\\uDCBB\\uDCB"
"C\\uDD27\\uDD2C\\uDE80\\uDE92])|\\uD83E[\\uDDB0-\\uDDB3])|\\uD83C[\\u"
"DFFB-\\uDFFF](?:\\u200D(?:[\\u2695\\u2696\\u2708]\\uFE0F|\\uD83C[\\u"
"DF3E\\uDF73\\uDF93\\uDFA4\\uDFA8\\uDFEB\\uDFED]|\\uD83D[\\uDCBB\\uDCB"
"C\\uDD27\\uDD2C\\uDE80\\uDE92]|\\uD83E[\\uDDB0-\\uDDB3]))?)?|[\\uDC6"
"A-\\uDC6D]|\\uDC6E(?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-"
"\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|\\uDC6F(?:\\u200D[\\u2"
"640\\u2642]\\uFE0F)?|\\uDC70(?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDC71(?"
":\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\"
"u2640\\u2642]\\uFE0F)?)?|\\uDC72(?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDC"
"73(?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u20"
"0D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDC74-\\uDC76](?:\\uD83C[\\uDFFB-\\"
"uDFFF])?|\\uDC77(?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\"
"uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|\\uDC78(?:\\uD83C[\\uDF"
"FB-\\uDFFF])?|[\\uDC79-\\uDC7B]|\\uDC7C(?:\\uD83C[\\uDFFB-\\uDFFF])"
"?|[\\uDC7D-\\uDC80]|[\\uDC81\\uDC82](?:\\u200D[\\u2640\\u2642]\\uFE0"
"F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|\\uD"
"C83(?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDC84|\\uDC85(?:\\uD83C[\\uDFFB-"
"\\uDFFF])?|[\\uDC86\\uDC87](?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C"
"[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDC88-\\uD"
"CA9]|\\uDCAA(?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDCAB-\\uDCFD\\uDCFF-\\"
"uDD3D\\uDD49-\\uDD4E\\uDD50-\\uDD67\\uDD6F\\uDD70\\uDD73]|\\uDD74(?:"
"\\uD83C[\\uDFFB-\\uDFFF])?|\\uDD75(?:\\uD83C[\\uDFFB-\\uDFFF](?:\\u2"
"00D[\\u2640\\u2642]\\uFE0F)?|\\uFE0F\\u200D[\\u2640\\u2642]\\uFE0F)?"
"|[\\uDD76-\\uDD79]|\\uDD7A(?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDD87\\uD"
"D8A-\\uDD8D]|[\\uDD90\\uDD95\\uDD96](?:\\uD83C[\\uDFFB-\\uDFFF])?|["
"\\uDDA4\\uDDA5\\uDDA8\\uDDB1\\uDDB2\\uDDBC\\uDDC2-\\uDDC4\\uDDD1-\\uDD"
"D3\\uDDDC-\\uDDDE\\uDDE1\\uDDE3\\uDDE8\\uDDEF\\uDDF3\\uDDFA-\\uDE44]|"
"[\\uDE45-\\uDE47](?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\"
"uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDE48-\\uDE4A]|\\uDE"
"4B(?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u20"
"0D[\\u2640\\u2642]\\uFE0F)?)?|\\uDE4C(?:\\uD83C[\\uDFFB-\\uDFFF])?|"
"[\\uDE4D\\uDE4E](?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\u"
"DFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|\\uDE4F(?:\\uD83C[\\uDFF"
"B-\\uDFFF])?|[\\uDE80-\\uDEA2]|\\uDEA3(?:\\u200D[\\u2640\\u2642]\\uF"
"E0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|["
"\\uDEA4-\\uDEB3]|[\\uDEB4-\\uDEB6](?:\\u200D[\\u2640\\u2642]\\uFE0F|"
"\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDE"
"B7-\\uDEBF]|\\uDEC0(?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDEC1-\\uDEC5\\u"
"DECB]|\\uDECC(?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDECD-\\uDED2\\uDEE0-"
"\\uDEE5\\uDEE9\\uDEEB\\uDEEC\\uDEF0\\uDEF3-\\uDEF9])|\\uD83E(?:[\\uDD"
"10-\\uDD17]|[\\uDD18-\\uDD1C](?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDD1D|"
"[\\uDD1E\\uDD1F](?:\\uD83C[\\uDFFB-\\uDFFF])?|[\\uDD20-\\uDD25]|\\uD"
"D26(?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u2"
"00D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDD27-\\uDD2F]|[\\uDD30-\\uDD36]("
"?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDD37(?:\\u200D[\\u2640\\u2642]\\uFE0"
"F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\u"
"DD38\\uDD39](?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFF"
"F](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|\\uDD3A|\\uDD3C(?:\\u200D[\\"
"u2640\\u2642]\\uFE0F)?|[\\uDD3D\\uDD3E](?:\\u200D[\\u2640\\u2642]\\u"
"FE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|"
"[\\uDD40-\\uDD45\\uDD47-\\uDD70\\uDD73-\\uDD76\\uDD7A\\uDD7C-\\uDDA2\\"
"uDDB0-\\uDDB4]|[\\uDDB5\\uDDB6](?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDDB"
"7|[\\uDDB8\\uDDB9](?:\\u200D[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-"
"\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\uDDC0-\\uDDC2\\uDDD"
"0]|[\\uDDD1-\\uDDD5](?:\\uD83C[\\uDFFB-\\uDFFF])?|\\uDDD6(?:\\u200D"
"[\\u2640\\u2642]\\uFE0F|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u"
"2642]\\uFE0F)?)?|[\\uDDD7-\\uDDDD](?:\\u200D[\\u2640\\u2642]\\uFE0F"
"|\\uD83C[\\uDFFB-\\uDFFF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?)?|[\\uD"
"DDE\\uDDDF](?:\\u200D[\\u2640\\u2642]\\uFE0F)?|[\\uDDE0-\\uDDFF])"
0

Regex is too slow, and Emoji is updated very fast.

Try this project simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

Simple with:

EmojiUtils.containsEmoji(str)
coder4
  • 319
  • 2
  • 4
0

Some PRCE do not acknowlege \p. Many do not permit ranges on characters exceeding 2 bytes \udde6-\ud83c.

One effective trick I have come up with is to encode them so the characters are forced to be escaped, such as json.

After encoding to json, the characters are now a literal \ud000 which one can find with a standard regular express: \\\\ud[0-9a-f]{3}, \\\\u[0-9a-f]{4,6}

After filtering the escaped strings, the data can be decoded again without the emoticons present.

ppostma1
  • 3,616
  • 1
  • 27
  • 28
0

Here's a simpler approach and a regex that correctly parses (at current date May 2021) all 3,521 emojis.

It's a programmatically built, simple alternation the works by matching the longest emojis first, thus avoiding the problem that arises with a lot of suggested patterns, that of partial matching within compound emojis. (example: ‍❤️‍‍ - since this is several emojis glued together with the Zero Width Joiner (U+200D) you need to match the longer sequence without partial matching on the components)

So that the pattern is short enough to be pasted right here, we've brazenly used literal emojis, but unicode escapes work just as well (see links at bottom for demos and source code):

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MyClass {
    public static void main(String args[]) {
      String line = "Adds  word-relevant  emojis  ❤ to  text ✨ with  sometimes  hilarious   results . Read more  about‍❤️‍‍ matching compound emojis";
      String pattern = "(?:‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍||||‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍|‍❤️‍‍|‍❤️‍‍|‍❤️‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍‍|‍‍|‍❤️‍|‍❤️‍|‍❤️‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|‍‍|️‍️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍⚕️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍⚖️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍✈️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍♂️|‍♂️|‍♂️|‍♂️|‍♂️|‍♀️|‍♀️|‍♀️|‍♀️|‍♀️|‍️|️‍♂️|️‍♀️|️‍♂️|️‍♀️|️‍♂️|️‍♀️|️‍|️‍⚧️|⛹‍♂️|⛹‍♂️|⛹‍♂️|⛹‍♂️|⛹‍♂️|⛹‍♀️|⛹‍♀️|⛹‍♀️|⛹‍♀️|⛹‍♀️|‍|‍|❤️‍|❤️‍|‍♂️|‍♀️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍♀️|‍♂️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍⚕️|‍⚕️|‍⚕️|‍|‍|‍|‍|‍|‍|‍⚖️|‍⚖️|‍⚖️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍✈️|‍✈️|‍✈️|‍|‍|‍|‍|‍|‍|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍|‍|‍|‍|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍|‍|‍|‍|‍|‍|‍|‍|‍|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|⛹️‍♂️|⛹️‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍♂️|‍♀️|‍|‍|‍|‍|‍|‍❄️|‍☠️|‍⬛|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||#️⃣|0️⃣|1️⃣|2️⃣|3️⃣|4️⃣|5️⃣|6️⃣|7️⃣|8️⃣|9️⃣|✋|✋|✋|✋|✋|✌|✌|✌|✌|✌|☝|☝|☝|☝|☝|✊|✊|✊|✊|✊|✍|✍|✍|✍|✍|⛹|⛹|⛹|⛹|⛹||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||☺|☹|☠|❣|❤|✋|✌|☝|✊|✍|⛷|⛹|☘|☕|⛰|⛪|⛩|⛲|⛺|♨|⛽|⚓|⛵|⛴|✈|⌛|⏳|⌚|⏰|⏱|⏲|☀|⭐|☁|⛅|⛈|☂|☔|⛱|⚡|❄|☃|⛄|☄|✨|⚽|⚾|⛳|⛸|♠|♥|♦|♣|♟|⛑|☎|⌨|✉|✏|✒|✂|⛏|⚒|⚔|⚙|⚖|⛓|⚗|⚰|⚱|♿|⚠|⛔|☢|☣|⬆|↗|➡|↘|⬇|↙|⬅|↖|↕|↔|↩|↪|⤴|⤵|⚛|✡|☸|☯|✝|☦|☪|☮|♈|♉|♊|♋|♌|♍|♎|♏|♐|♑|♒|♓|⛎|▶|⏩|⏭|⏯|◀|⏪|⏮|⏫|⏬|⏸|⏹|⏺|⏏|♀|♂|⚧|✖|➕|➖|➗|♾|‼|⁉|❓|❔|❕|❗|〰|⚕|♻|⚜|⭕|✅|☑|✔|❌|❎|➰|➿|〽|✳|✴|❇|©|®|™|ℹ|Ⓜ|㊗|㊙|⚫|⚪|⬛|⬜|◼|◻|◾|◽|▪|▫)";
      var i = 0;
      Pattern r = Pattern.compile(pattern);
      Matcher m = r.matcher(line);
      while(m.find( )) {
         i++;
         System.out.println("Found value: " + m.group(0) );
      }
      System.out.println("Found " + i + " emojis." );
    }
}

More information:

https://github.com/sweaver2112/Regex-combined-emojis

Regex 101 Demo (compact, unsafe literal emoji version)

Regex 101 Demo (long, safe unicode escape version)

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43