How do I remove emoji from string

Question

My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

there are a LOT of emoji - maybe it's better to make a blacklist of characters to remove? — dax, Jul 10 '14 at 09:26
@dax mostly those Emojis that are in iPhone and Android Keyboard — kilua, Jul 10 '14 at 09:27

Stefan · Accepted Answer · 2018-06-05T21:09:01.080

37

Karol S already provided a solution, but the reason might not be clear:

"\u1F600" is actually "\u1F60" followed by "0":

"\u1F60"    # => "ὠ"
"\u1F600"   # => "ὠ0"

You have to use curly braces for code points above FFFF:

"\u{1F600}" #=> ""

Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

Using curly braces solves the issue:

/[\u{1F600}-\u{1F6FF}]/

This matches (emoji) characters in these unicode blocks:

U+1F600..U+1F64F Emoticons
U+1F650..U+1F67F Ornamental Dingbats
U+1F680..U+1F6FF Transport and Map Symbols

You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

s = 'Hi!'
#=> "Hi!\360\237\230\200"

s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!"

Regarding your Rubular example – Emoji are single characters:

"".length  #=> 1
"".chars   #=> [""]

Whereas kaomoji are a combination of multiple characters:

"^_^".length #=> 3
"^_^".chars  #=> ["^", "_", "^"]

Matching these is a very different task (and you should ask that in a separate question).

edited Jun 05 '18 at 21:09

answered Jul 10 '14 at 10:31

Stefan

109,145
14
143
218

1

I have tried the regex that you provided. and this is the [link](http://rubular.com/r/C7HpUmiJjI) and this is using my regex that I mentioned in the question [link](http://rubular.com/r/cqO6RtTHdZ). It does not work with your regex, mine is working but has problem like I mentioned in the question. – kilua Jul 11 '14 at 03:19
@kilua those are not [emoji](http://en.wikipedia.org/wiki/Emoji) but [kaomoji](http://en.wikipedia.org/wiki/Kaomoji) – Stefan Jul 11 '14 at 05:14
Owh I see, lets just focus on emoji then. Since I am building this to prevent user to submit emoji from IOS/Android to our server. I know that, they can disable it on keyboard (phone) but still on the server side I need to filter it out. Yes, your regex is working fine for emoji tho, but it does not work for other emoticions like the "houese" "town" "animals" "etc". I have removed the kaomoji – kilua Jul 11 '14 at 06:56
You have to add the appropriate character ranges, e.g. to include [Miscellaneous Symbols And Pictographs](http://en.wikipedia.org/wiki/Miscellaneous_Symbols_and_Pictographs) start with U+1F300: `/[\u{1F300}-\u{1F6FF}]/`. Karol S already mentioned that. – Stefan Jul 11 '14 at 07:09
Yes, correct. I am using that one right now but seems it didnt cater all the emoji or those miscellaneous thing. You know, I am using MAC OSX, you can do like ctrl + cmd + space, then you can get those emojis + etc. Not all of them are cater by `/[\u{1F300}-\u{1F6FF}]/` any help? Thank you very much – kilua Jul 11 '14 at 07:40
@kilua I've posted a [follow-up question](http://stackoverflow.com/q/24695159/477037) – Stefan Jul 11 '14 at 10:03
1

Are you sure about `"^_^".length #=> 1`? – sawa Jul 11 '14 at 10:45
@Stefan this is a great start but doesn't match all of "" ... I found this solution in Ruby which seems to work well: http://stackoverflow.com/questions/16487697/how-to-remove-4-byte-utf-8-characters-in-ruby – steve Apr 14 '16 at 20:51
Stefan your edition didn't work for ruby 1.8.7, the between is not taking the number inside the hex range, so it returns false, and then the emoji is not rejected – G. I. Joe Jun 05 '18 at 15:53
@G.I.Joe it does work, `0x1F600` is just another way of writing `128512`. – Stefan Jun 05 '18 at 16:14
no it doesn't, I ran it even your range is repeating the limits – G. I. Joe Jun 05 '18 at 20:32
@G.I.Joe sorry, there was a typo, the upper limit of course has to be `0x1F6FF` – Stefan Jun 05 '18 at 21:09

score 24 · Answer 2 · answered Oct 29 '15 at 07:14

24

I am using one based on this script.

 def strip_emoji(text)
    text = text.force_encoding('utf-8').encode
    clean = ""

    # symbols & pics
    regex = /[\u{1f300}-\u{1f5ff}]/
    clean = text.gsub regex, ""

    # enclosed chars 
    regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
    clean = clean.gsub regex, ""

    # emoticons
    regex = /[\u{1f600}-\u{1f64f}]/
    clean = clean.gsub regex, ""

    #dingbats
    regex = /[\u{2702}-\u{27b0}]/
    clean = clean.gsub regex, ""
  end

Results:

irb> strip_emoji("☂❤华み원❤")
=> "华み원"

answered Oct 29 '15 at 07:14

jellene

409
3
9

great answer.. saves my day.. !! :) – Vishal Nov 10 '16 at 11:31
This worked well for me. I created a EmojiStripper concern that uses a before_validation callback to strip emojis from all string fields before validation is executed. That results in all emojis being stripped before it is saved to the DB. – curtp Dec 27 '16 at 21:29
3

WARNING: THE CODE IN THIS ANSWER WILL NOT REMOVE ALL EMOJIS. It removes simple emojis fine, but it does not fully remove multi code points emojis correctly, such as ‍‍‍or ☸️. – Jerome Dalbert Oct 04 '18 at 17:17

franklsf95 · Answer 3 · 2015-03-31T20:52:40.133

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:

[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]

I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.

Example usage:

regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji  and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji  and other Unicode characters 比如中文."

Other Unicode characters, such as Asian characters, are preserved.

EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments for details.

Huh, I pasted this into rubular and found that it matched numbers, too — rmosolgo, Mar 31 '15 at 16:12
@rmosolgo Thanks for catching this! I have excluded numbers and other ASCII characters from the emoji. The reason numbers were included is that some emoji are of the form 8⃣ (`U+0038 U+20E3`). I manually removed those ASCII codes. — franklsf95, Mar 31 '15 at 20:51

Jerome Dalbert · Answer 4 · 2021-07-01T23:08:43.610

Most of the answers in this thread don't remove all emojis correctly. They remove simple emojis like fine. But they won't fully remove multi code point emojis like ‍‍‍ or ☸️, leaving some residual unicode code points behind.

You could use a gem like unicode-emoji to get the latest emoji regexes, but if you find this overkill the following code might be a good enough solution:

text.gsub(/[^[:alnum:][:blank:][:punct:]]/, '').squeeze(' ').strip

This will remove any emoji or weird-unicody-like character that is not a basic alphanum/punct/blank.

Karol S · Answer 5 · 2014-07-10T10:18:39.893

11

REGEX = /[^\u{1F600}-\u{1F6FF}\s]/

or

REGEX = /[\u{1F600}-\u{1F6FF}\s]/
REGEX = /[\u{1F600}-\u{1F6FF}]/
REGEX = /[^\u{1F600}-\u{1F6FF}]/

because your original regex seems to indicate you try to find everything that is not an amoji and not a whitespace and I don't know why would you want to do it.

Also:

the emoji are 1F300-1F6FF rather than 1F600-1F6FF; you may want to change that
if you want to remove all astral characters (for example you deal with a software that doesn't support all of Unicode), you should use 10000-10FFFF.

EDIT: You almost certainly want REGEX = /[\u{1F600}-\u{1F6FF}]/ or similar. Your original regex matched everything that is not a whitespace, and not in range 0-\u1F6F. Since spaces are whitespace, and English letters are in range 0-\u1F6F, and Chinese characters are in neither, the regex matched Chinese characters and removed them.

edited Jul 10 '14 at 10:18

answered Jul 10 '14 at 09:45

Karol S

9,028
2
32
45

Thanks for replied, I have tried all your regex in rubular, none of them are working. This is mine [link](http://rubular.com/r/cqO6RtTHdZ) but it has problem that I stated in question... – kilua Jul 11 '14 at 03:37
1

Your sample list doesn't contain any [emoji](https://en.wikipedia.org/wiki/Emoji), it contains [kaomoji](https://en.wikipedia.org/wiki/Kaomoji). Kaomoji are made from mix of letters and symbols, you can't remove them with a simple regex. – Karol S Jul 11 '14 at 08:37
ya my mistake, now I understand how it works... Thanks for your replied – kilua Jul 11 '14 at 09:09
Any idea why my regex doesn't compile? I'm doing `ls | perl -e 'print if /[^\u{1F600}-\u{1F6FF}\s]/'` to find filenames containing emoji. – Sridhar Sarnobat Feb 04 '21 at 19:33
1

@SridharSarnobat Assuming your system's locale is UTF-8, you need to tell Perl to use UTF-8 on standard I/O: `ls | perl -CSD -ne 'print if /[^\u{1F600}-\u{1F6FF}\s]/'` – Karol S Feb 05 '21 at 13:23
@KarolS thanks, I need to try this when I get home! – Sridhar Sarnobat Feb 05 '21 at 13:25

score 2 · Answer 6 · answered Aug 28 '15 at 06:39

2

Instead of removing Emoji characters, you can only include alphabets and numbers. A simple tr should do the trick, .tr('^A-Za-z0-9', ''). Of course this will remove all punctuation, but you can always modify the regex to suit your specific condition.

answered Aug 28 '15 at 06:39

Swaathi Kakarla

2,227
1
19
27

Tan Nguyen · Answer 7 · 2018-01-11T04:01:25.823

1

This very short Regex covers all Emoji in getemoji.com so far:

[\u{1F300}-\u{1F5FF}|\u{1F1E6}-\u{1F1FF}|\u{2700}-\u{27BF}|\u{1F900}-\u{1F9FF}|\u{1F600}-\u{1F64F}|\u{1F680}-\u{1F6FF}|\u{2600}-\u{26FF}]

edited Jan 11 '18 at 04:01

answered Jan 10 '18 at 10:43

Tan Nguyen

3,281
1
18
18

Same regexp using `\U` (for Python, Postgres, etc.): `[\U0001F300-\U0001F5FF|\U0001F1E6-\U0001F1FF|\U00002700-\U000027BF|\U0001F900-\U0001F9FF|\U0001F600-\U0001F64F|\U0001F680-\U0001F6FF|\U00002600-\U000026FF]` – Ilya Semenov Mar 19 '18 at 05:53

score 1 · Answer 8 · edited Jul 17 '18 at 13:43

1

CARE the answer from Aray have some side effects.

"-".gsub(/[^\p{L}\s]+/, '').squeeze(' ').strip
=> ""

even when this is suppose to be a simple minus (-)

edited Jul 17 '18 at 13:43

Eric Aya

69,473
35
181
253

answered Jul 17 '18 at 13:14

Filipe Santiago

33
8

score 0 · Answer 9 · answered Aug 19 '15 at 11:25

I converted the RegEx from the RUBY project above to a JavaScript friendly RegEx:

    /// <summary>
    /// Emoji symbols character sets (added \s and +)
    /// Unicode with עברית Delete the emoji to match 
    /// https://regex101.com/r/jP5jC5/3
    /// https://github.com/franklsf95/ruby-emoji-regex
    /// http://stackoverflow.com/questions/24672834/how-do-i-remove-emoji-from-string
    /// </summary>
    public const string Emoji = @"^[\s\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2694\u2696-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\u1F004\u1F0CF\u1F170-\u1F171\u1F17E-\u1F17F\u1F18E\u1F191-\u1F19A\u1F201-\u1F202\u1F21A\u1F22F\u1F232-\u1F23A\u1F250-\u1F251\u1F300-\u1F321\u1F324-\u1F393\u1F396-\u1F397\u1F399-\u1F39B\u1F39E-\u1F3F0\u1F3F3-\u1F3F5\u1F3F7-\u1F4FD\u1F4FF-\u1F53D\u1F549-\u1F54E\u1F550-\u1F567\u1F56F-\u1F570\u1F573-\u1F579\u1F587\u1F58A-\u1F58D\u1F590\u1F595-\u1F596\u1F5A5\u1F5A8\u1F5B1-\u1F5B2\u1F5BC\u1F5C2-\u1F5C4\u1F5D1-\u1F5D3\u1F5DC-\u1F5DE\u1F5E1\u1F5E3\u1F5EF\u1F5F3\u1F5FA-\u1F64F\u1F680-\u1F6C5\u1F6CB-\u1F6D0\u1F6E0-\u1F6E5\u1F6E9\u1F6EB-\u1F6EC\u1F6F0\u1F6F3\u1F910-\u1F918\u1F980-\u1F984\u1F9C0}]+$";

Usage:

if (!Regex.IsMatch(vm.NameFull, RegExKeys.Emoji)) // Match means no Emoji was found

score 0 · Answer 10 · answered Jul 15 '22 at 00:26

In Android | Kotlin you can use this extension function to remove all emojis from String

fun String.removeEmojis(): String = Pattern.compile("[^\\p{L}\\s]+")
    .matcher(this).replaceAll("")

Sample :

val result = "Hi emojis      removed".removeEmojis()
output => "Hi emojis removed"

dipti joshi · Answer 11 · 2021-09-08T13:26:09.023

-1

         // method to remove emoji from string
    public static String remove_emoji(String text){
                    String updated_text="";
                    for (int i=0;i<text.length();i++){
                        if(text.substring(i,i+1).matches("[\\x00-\\x7F]+")){
             // regex [\\x00-\\x7F]+ will check it contains emoji symbol or not,if it matches it means its not the emoji symbol            

updated_text=updated_text+text.substring(i,i+1);
                        }
                    }
                    return updated_text;
                }

edited Sep 08 '21 at 13:26

answered Aug 31 '21 at 15:35

dipti joshi

1
1

Providing more information about why this solves the problem can be a great way to improve your answer and help the users with the same problem – iunfixit Aug 31 '21 at 16:44

How do I remove emoji from string

11 Answers11

Linked

Related