28

My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

kilua
  • 711
  • 1
  • 9
  • 16

11 Answers11

37

Karol S already provided a solution, but the reason might not be clear:

"\u1F600" is actually "\u1F60" followed by "0":

"\u1F60"    # => "ὠ"
"\u1F600"   # => "ὠ0"

You have to use curly braces for code points above FFFF:

"\u{1F600}" #=> ""

Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

Using curly braces solves the issue:

/[\u{1F600}-\u{1F6FF}]/

This matches (emoji) characters in these unicode blocks:


You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

s = 'Hi!'
#=> "Hi!\360\237\230\200"

s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!" 

Regarding your Rubular exampleEmoji are single characters:

"".length  #=> 1
"".chars   #=> [""]

Whereas kaomoji are a combination of multiple characters:

"^_^".length #=> 3
"^_^".chars  #=> ["^", "_", "^"]

Matching these is a very different task (and you should ask that in a separate question).

Stefan
  • 109,145
  • 14
  • 143
  • 218
  • 1
    I have tried the regex that you provided. and this is the [link](http://rubular.com/r/C7HpUmiJjI) and this is using my regex that I mentioned in the question [link](http://rubular.com/r/cqO6RtTHdZ). It does not work with your regex, mine is working but has problem like I mentioned in the question. – kilua Jul 11 '14 at 03:19
  • @kilua those are not [emoji](http://en.wikipedia.org/wiki/Emoji) but [kaomoji](http://en.wikipedia.org/wiki/Kaomoji) – Stefan Jul 11 '14 at 05:14
  • Owh I see, lets just focus on emoji then. Since I am building this to prevent user to submit emoji from IOS/Android to our server. I know that, they can disable it on keyboard (phone) but still on the server side I need to filter it out. Yes, your regex is working fine for emoji tho, but it does not work for other emoticions like the "houese" "town" "animals" "etc". I have removed the kaomoji – kilua Jul 11 '14 at 06:56
  • You have to add the appropriate character ranges, e.g. to include [Miscellaneous Symbols And Pictographs](http://en.wikipedia.org/wiki/Miscellaneous_Symbols_and_Pictographs) start with U+1F300: `/[\u{1F300}-\u{1F6FF}]/`. Karol S already mentioned that. – Stefan Jul 11 '14 at 07:09
  • Yes, correct. I am using that one right now but seems it didnt cater all the emoji or those miscellaneous thing. You know, I am using MAC OSX, you can do like ctrl + cmd + space, then you can get those emojis + etc. Not all of them are cater by `/[\u{1F300}-\u{1F6FF}]/` any help? Thank you very much – kilua Jul 11 '14 at 07:40
  • @kilua I've posted a [follow-up question](http://stackoverflow.com/q/24695159/477037) – Stefan Jul 11 '14 at 10:03
  • 1
    Are you sure about `"^_^".length #=> 1`? – sawa Jul 11 '14 at 10:45
  • @Stefan this is a great start but doesn't match all of "" ... I found this solution in Ruby which seems to work well: http://stackoverflow.com/questions/16487697/how-to-remove-4-byte-utf-8-characters-in-ruby – steve Apr 14 '16 at 20:51
  • Stefan your edition didn't work for ruby 1.8.7, the between is not taking the number inside the hex range, so it returns false, and then the emoji is not rejected – G. I. Joe Jun 05 '18 at 15:53
  • @G.I.Joe it does work, `0x1F600` is just another way of writing `128512`. – Stefan Jun 05 '18 at 16:14
  • no it doesn't, I ran it even your range is repeating the limits – G. I. Joe Jun 05 '18 at 20:32
  • @G.I.Joe sorry, there was a typo, the upper limit of course has to be `0x1F6FF` – Stefan Jun 05 '18 at 21:09
24

I am using one based on this script.

 def strip_emoji(text)
    text = text.force_encoding('utf-8').encode
    clean = ""

    # symbols & pics
    regex = /[\u{1f300}-\u{1f5ff}]/
    clean = text.gsub regex, ""

    # enclosed chars 
    regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
    clean = clean.gsub regex, ""

    # emoticons
    regex = /[\u{1f600}-\u{1f64f}]/
    clean = clean.gsub regex, ""

    #dingbats
    regex = /[\u{2702}-\u{27b0}]/
    clean = clean.gsub regex, ""
  end

Results:

irb> strip_emoji("☂❤华み원❤")
=> "华み원"
jellene
  • 409
  • 3
  • 9
  • great answer.. saves my day.. !! :) – Vishal Nov 10 '16 at 11:31
  • This worked well for me. I created a EmojiStripper concern that uses a before_validation callback to strip emojis from all string fields before validation is executed. That results in all emojis being stripped before it is saved to the DB. – curtp Dec 27 '16 at 21:29
  • 3
    WARNING: THE CODE IN THIS ANSWER WILL NOT REMOVE ALL EMOJIS. It removes simple emojis fine, but it does not fully remove multi code points emojis correctly, such as ‍‍‍or ☸️. – Jerome Dalbert Oct 04 '18 at 17:17
20

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:

[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]

I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.

Example usage:

regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji  and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji  and other Unicode characters 比如中文."

Other Unicode characters, such as Asian characters, are preserved.

EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments for details.

franklsf95
  • 1,182
  • 12
  • 23
  • 1
    Huh, I pasted this into rubular and found that it matched numbers, too – rmosolgo Mar 31 '15 at 16:12
  • @rmosolgo Thanks for catching this! I have excluded numbers and other ASCII characters from the emoji. The reason numbers were included is that some emoji are of the form 8⃣ (`U+0038 U+20E3`). I manually removed those ASCII codes. – franklsf95 Mar 31 '15 at 20:51
17

Most of the answers in this thread don't remove all emojis correctly. They remove simple emojis like fine. But they won't fully remove multi code point emojis like ‍‍‍ or ☸️, leaving some residual unicode code points behind.

You could use a gem like unicode-emoji to get the latest emoji regexes, but if you find this overkill the following code might be a good enough solution:

text.gsub(/[^[:alnum:][:blank:][:punct:]]/, '').squeeze(' ').strip

This will remove any emoji or weird-unicody-like character that is not a basic alphanum/punct/blank.

Jerome Dalbert
  • 10,067
  • 6
  • 56
  • 64
11
REGEX = /[^\u{1F600}-\u{1F6FF}\s]/

or

REGEX = /[\u{1F600}-\u{1F6FF}\s]/
REGEX = /[\u{1F600}-\u{1F6FF}]/
REGEX = /[^\u{1F600}-\u{1F6FF}]/

because your original regex seems to indicate you try to find everything that is not an amoji and not a whitespace and I don't know why would you want to do it.

Also:

  • the emoji are 1F300-1F6FF rather than 1F600-1F6FF; you may want to change that

  • if you want to remove all astral characters (for example you deal with a software that doesn't support all of Unicode), you should use 10000-10FFFF.

EDIT: You almost certainly want REGEX = /[\u{1F600}-\u{1F6FF}]/ or similar. Your original regex matched everything that is not a whitespace, and not in range 0-\u1F6F. Since spaces are whitespace, and English letters are in range 0-\u1F6F, and Chinese characters are in neither, the regex matched Chinese characters and removed them.

Karol S
  • 9,028
  • 2
  • 32
  • 45
  • Thanks for replied, I have tried all your regex in rubular, none of them are working. This is mine [link](http://rubular.com/r/cqO6RtTHdZ) but it has problem that I stated in question... – kilua Jul 11 '14 at 03:37
  • 1
    Your sample list doesn't contain any [emoji](https://en.wikipedia.org/wiki/Emoji), it contains [kaomoji](https://en.wikipedia.org/wiki/Kaomoji). Kaomoji are made from mix of letters and symbols, you can't remove them with a simple regex. – Karol S Jul 11 '14 at 08:37
  • ya my mistake, now I understand how it works... Thanks for your replied – kilua Jul 11 '14 at 09:09
  • Any idea why my regex doesn't compile? I'm doing `ls | perl -e 'print if /[^\u{1F600}-\u{1F6FF}\s]/'` to find filenames containing emoji. – Sridhar Sarnobat Feb 04 '21 at 19:33
  • 1
    @SridharSarnobat Assuming your system's locale is UTF-8, you need to tell Perl to use UTF-8 on standard I/O: `ls | perl -CSD -ne 'print if /[^\u{1F600}-\u{1F6FF}\s]/'` – Karol S Feb 05 '21 at 13:23
  • @KarolS thanks, I need to try this when I get home! – Sridhar Sarnobat Feb 05 '21 at 13:25
2

Instead of removing Emoji characters, you can only include alphabets and numbers. A simple tr should do the trick, .tr('^A-Za-z0-9', ''). Of course this will remove all punctuation, but you can always modify the regex to suit your specific condition.

Swaathi Kakarla
  • 2,227
  • 1
  • 19
  • 27
1

This very short Regex covers all Emoji in getemoji.com so far:

[\u{1F300}-\u{1F5FF}|\u{1F1E6}-\u{1F1FF}|\u{2700}-\u{27BF}|\u{1F900}-\u{1F9FF}|\u{1F600}-\u{1F64F}|\u{1F680}-\u{1F6FF}|\u{2600}-\u{26FF}]
Tan Nguyen
  • 3,281
  • 1
  • 18
  • 18
  • Same regexp using `\U` (for Python, Postgres, etc.): `[\U0001F300-\U0001F5FF|\U0001F1E6-\U0001F1FF|\U00002700-\U000027BF|\U0001F900-\U0001F9FF|\U0001F600-\U0001F64F|\U0001F680-\U0001F6FF|\U00002600-\U000026FF]` – Ilya Semenov Mar 19 '18 at 05:53
1

CARE the answer from Aray have some side effects.

"-".gsub(/[^\p{L}\s]+/, '').squeeze(' ').strip
=> ""

even when this is suppose to be a simple minus (-)

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
0

I converted the RegEx from the RUBY project above to a JavaScript friendly RegEx:

    /// <summary>
    /// Emoji symbols character sets (added \s and +)
    /// Unicode with עברית Delete the emoji to match 
    /// https://regex101.com/r/jP5jC5/3
    /// https://github.com/franklsf95/ruby-emoji-regex
    /// http://stackoverflow.com/questions/24672834/how-do-i-remove-emoji-from-string
    /// </summary>
    public const string Emoji = @"^[\s\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2694\u2696-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\u1F004\u1F0CF\u1F170-\u1F171\u1F17E-\u1F17F\u1F18E\u1F191-\u1F19A\u1F201-\u1F202\u1F21A\u1F22F\u1F232-\u1F23A\u1F250-\u1F251\u1F300-\u1F321\u1F324-\u1F393\u1F396-\u1F397\u1F399-\u1F39B\u1F39E-\u1F3F0\u1F3F3-\u1F3F5\u1F3F7-\u1F4FD\u1F4FF-\u1F53D\u1F549-\u1F54E\u1F550-\u1F567\u1F56F-\u1F570\u1F573-\u1F579\u1F587\u1F58A-\u1F58D\u1F590\u1F595-\u1F596\u1F5A5\u1F5A8\u1F5B1-\u1F5B2\u1F5BC\u1F5C2-\u1F5C4\u1F5D1-\u1F5D3\u1F5DC-\u1F5DE\u1F5E1\u1F5E3\u1F5EF\u1F5F3\u1F5FA-\u1F64F\u1F680-\u1F6C5\u1F6CB-\u1F6D0\u1F6E0-\u1F6E5\u1F6E9\u1F6EB-\u1F6EC\u1F6F0\u1F6F3\u1F910-\u1F918\u1F980-\u1F984\u1F9C0}]+$";

Usage:

if (!Regex.IsMatch(vm.NameFull, RegExKeys.Emoji)) // Match means no Emoji was found
Yovav
  • 2,557
  • 2
  • 32
  • 53
0

In Android | Kotlin you can use this extension function to remove all emojis from String

fun String.removeEmojis(): String = Pattern.compile("[^\\p{L}\\s]+")
    .matcher(this).replaceAll("")

Sample :

val result = "Hi emojis      removed".removeEmojis()
output => "Hi emojis removed"
Majid Arabi
  • 141
  • 1
  • 8
-1
         // method to remove emoji from string
    public static String remove_emoji(String text){
                    String updated_text="";
                    for (int i=0;i<text.length();i++){
                        if(text.substring(i,i+1).matches("[\\x00-\\x7F]+")){
             // regex [\\x00-\\x7F]+ will check it contains emoji symbol or not,if it matches it means its not the emoji symbol            

updated_text=updated_text+text.substring(i,i+1);
                        }
                    }
                    return updated_text;
                }
  • Providing more information about why this solves the problem can be a great way to improve your answer and help the users with the same problem – iunfixit Aug 31 '21 at 16:44