92

I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?

I have observed all my emjois start with \xf but when I try to search for str.startswith("\xf") I get invalid character error.

emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)

Here's the error:

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    re.sub(emoji_pattern,'',word)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Each of the items in a list can be a word ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

UPDATE: I used this other code:

emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
                                 |\
                                 [\U0001F300-\U0001F5FF] # symbols & pictographs\
                                 |\
                                 [\U0001F680-\U0001F6FF] # transport & map symbols\
                                 |\
                                 [\U0001F1E0-\U0001F1FF] # flags (iOS)\
                          " " ", re.VERBOSE)

emoji_pattern.sub('', word)

But this still doesn't remove the emojis and shows them! Any clue why is that? enter image description here

Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
  • 3
    Emoji characters are not restricted to a single range (see [this](http://www.unicode.org/Public/emoji/1.0/emoji-data.txt) list of characters). – 一二三 Oct 29 '15 at 02:36
  • 1
    Your emojis don't start with `\xf`. You're probably seeing the bytes representing that string in UTF-8, and the first byte is `0xf0`. – roeland Oct 29 '15 at 03:57
  • 1
    related: [remove unicode emoji using re in python](http://stackoverflow.com/q/26568722/4279) – jfs Oct 29 '15 at 14:46
  • Please check: https://stackoverflow.com/questions/52464119/removing-emoji-from-text-remove-also-japanese-langauge/52464600#52464600 For a bug in the chosen answer. – Sion C Sep 23 '18 at 09:17

27 Answers27

82

On Python 2, you have to use u'' literal to create a Unicode string. Also, you should pass re.UNICODE flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')):

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

Output

This dog 
This dog 

Note: emoji_pattern matches only some emoji (not all). See Which Characters are Emoji.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • so if you have the flag you don't need to decode you data using `.decode('utf-8') ? – Mona Jalal Oct 29 '15 at 16:58
  • 1
    @MonaJalal: no, `.decode()` converts a bytestring into Unicode string. You should prefer Unicode strings while working with text (`type(text) == unicode` on Python 2) – jfs Oct 29 '15 at 17:03
  • Hey this worked except now all my words end up having a u in the beginning. you know how can that be removed? `[u'This', u'dog', u'', u'https://t.co/5N86jYipOI']` – Mona Jalal Oct 29 '15 at 19:47
  • on a side question, the answer you linked me to, do I need to have a try except phrase or would this do the job? http://stackoverflow.com/a/26568779/2414957 – Mona Jalal Oct 29 '15 at 19:50
  • 1
    @MonaJalal: Linux uses a wide python2 build by default and therefore the code in the answer should work as is there. You might need the try/except only on a narrow python2 build e.g., on Windows (You could update to Python 3, to avoid thinking about narrow/wide builds -- the code in the answer works on Python 3 too). – jfs Oct 29 '15 at 19:55
  • This is my python version `Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2` hope I might be fine – Mona Jalal Oct 29 '15 at 19:57
  • btw while the code removes some of the emojis, it doesn't remove all. Like here it didn't remove some: http://pastebin.com/tAETGfWz – Mona Jalal Oct 29 '15 at 20:10
  • 1
    @MonaJalal: edit your question and put the necessary info there. Try to limit your questions to a single issue so that the question might be useful to somebody else too (you had *"sre_constants.error: bad character range"* issue that is explained in [Bryan Oakley's answer](http://stackoverflow.com/a/33404838/4279), I've shown how to properly write `emoji_pattern` without `ur" " " ...\\` (I haven't try to find [valid emoji ranges](http://goo.gl/8DuLMK)). Unfortunately, neither directly answer the question in the title of your question. Also, [don't encode to bytes](http://goo.gl/1J1LPa) – jfs Oct 29 '15 at 20:31
  • Sure, thanks a lot for guide. I will create another question with the solution you provided and will put the link here. – Mona Jalal Oct 29 '15 at 20:35
  • I do not wanted to remove emoji from my code i just wanted to transform it into unicode , anyone have any clue ? – Shubham Sharma Sep 15 '17 at 12:24
  • What is your input (a plain text file (what is character encoding), a json document you've downloaded from an http server, from a db, a bytestring in utf-8, etc). What you get (what is the result of your current code) and what do you want to get instead (the desirable output). Create a [minimal but complete code example](https://stackoverflow.com/help/mcve) that demonstrates the problem and post all this info as a separate Stack Overflow question. – jfs Sep 15 '17 at 12:31
  • 3
    It didnt work on `เบอร์10!! ส้มสวย 01แฝดของ08 พร้อมส่ง!` string which is `\xF0\x9F\x92\x8B\xF0\x9F` – Umair Ayub Oct 10 '17 at 10:10
  • @Umair: make sure, you've read [my comment above](https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python/33417311#comment54637275_33417311). – jfs Oct 10 '17 at 21:27
  • @jfs, Hello, Can you please help me in removing this ` `. I'm getting this error `unacceptable character #x1f914:` – shaik moeed Jan 12 '19 at 10:12
  • so how do we remove emojis such as the one in Trump is throwing out names of people and what they've done for him and they coming back saying it's a lie. \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 ? @jfs – Mona Jalal Mar 24 '19 at 23:54
  • @MonaJalal: You have UTF-8 encoded bytes as a `str`. You need to decode it to true Unicode text (the Py2 `unicode` type) for this to work, e.g. `encodedtext = '\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82'`, `text = encodedtext.decode('utf-8')`. Then this will work using `text`, rather than the encoded form. – ShadowRanger Sep 13 '19 at 16:04
  • Alternately you could simplify and use re.sub ()```result = re.sub('[(\U0001F600-\U0001F64F|\U0001F300-\U0001F5FF|\U0001F680-\U0001F6FF|\U0001F1E0-\U0001F1FF|\U0001F90C-\U0001F9FF)]+','','A quick brown fox jumps over the lazy dog')``` – Nimin Unnikrishnan Apr 24 '20 at 06:08
68

Complete Version of remove Emojis

import re
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)
Karim Omaya
  • 841
  • 7
  • 7
  • It works well, thank you. But for me it didn't remove this icon: ⏪. – Abdel Oct 21 '21 at 10:55
  • 1
    this removes some arabic letters. Thus messing up Arabic text. Please advise – R.A Dec 05 '21 at 12:30
  • 4
    this works, but: `u"\U00002702-\U000027B0"` is duplicated, `u"\U000024C2-\U0001F251"` already includes ranges `u"\U00002500-\U00002BEF"` and `u"\U00002702-\U000027B0"`. Also `u"\U00010000-\U0010ffff"` already includes everything with 5+ digits before it and `u"\u2600-\u2B55"` already includes `u"\u2640-\u2642"`. So this answer could be shorter and more concise. – lateus Dec 14 '21 at 20:28
55

I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.

#!/usr/bin/env python
import re

text = u'This is a smiley face \U0001f602'
print(text) # with emoji

def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

print(deEmojify(text))

This was my previous answer, do not use this.

def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')
Abdul-Razak Adam
  • 1,070
  • 11
  • 19
  • 37
    This strips all non-ASCII characters, and does so **very inefficiently** (why not just `inputString.encode('ascii', 'ignore').decode('ascii')` and be done with it in one single step?) . There is more to the larger Unicode standard than just Emoji, you can't just strip Latin, Greek, Hangul, Myanmar, Tibetan, Egyptian or [any of the other Unicode-supported scripts](https://en.wikipedia.org/wiki/Script_(Unicode)#List_of_scripts_in_Unicode) just to remove the Emoji. – Martijn Pieters Aug 10 '18 at 11:10
  • this is the only solution that worked for text = 'This dog \xe2\x80\x9d \xe2\x80\x9c' – Mona Jalal Mar 25 '19 at 00:15
  • 2
    @MonaJalal: That string isn't actually Unicode (it's the raw bytes representing the UTF-8 encoding of actual Unicode). Even decoded, it has no emoji at all (those bytes decode to right and left "smart quotes"). If this solves your problem, your problem wasn't what your question was asking about; this removes all non-ASCII characters (including simple stuff like accented e, `é`), not just emoji. – ShadowRanger Sep 13 '19 at 15:56
  • This removes other language characters apart from emoji. Is there any other way to remove only the emojis? @MartijnPieters – Ishara Malaviarachchi Nov 27 '19 at 13:45
  • 2
    @IsharaMalaviarachchi: I wrote an answer to a different question that removes emoji: [Remove Emoji's from multilingual Unicode text](//stackoverflow.com/a/51785357) – Martijn Pieters Nov 27 '19 at 16:55
  • This does not eliminate the emojis entered through the desktop or google chrome. It eliminates emojis entered through the mobile interface. By Pressing windows + full stop we can enter emojis which I'm unable to eliminate – shreesh katti Jan 22 '20 at 11:20
  • this is remove other language chars – Ali Mohammadi Feb 21 '20 at 22:13
  • Hi, used that to remove emojis but as I working on Turkish, it also removed also Turkish characters such as ş,ı,ğ,ç,ö,ü is there any way to omit those? – Berkehan May 27 '20 at 12:05
  • Very nice solution @Mona!! Using @Martijns formula flagged all of the Asian characters in my news feeds, which was not the goal. – jsfa11 Jan 23 '21 at 23:13
  • how can i remove emojis code like :), :) :D – Sunil Garg Feb 25 '21 at 12:08
25

If you are not keen on using regex, the best solution could be using the emoji python package.

Here is a simple function to return emoji free text (thanks to this SO answer):

import emoji
def give_emoji_free_text(text):
    allchars = [str for str in text.decode('utf-8')]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
    return clean_text

If you are dealing with strings containing emojis, this is straightforward

>> s1 = "Hi  How is your  and . Have a nice weekend "
>> print s1
Hi  How is your  and . Have a nice weekend 
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend

If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.

>> s2 = u'This dog \U0001f602'
>> print s2
This dog 
>> print give_emoji_free_text(s2.encode('utf8'))
This dog

Edits

Based on the comment, it should be as easy as:

def give_emoji_free_text(text):
    return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
kingmakerking
  • 2,017
  • 2
  • 28
  • 44
  • 12
    The project does one better: it *includes a regex generator function*. Use `emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))` and be done with it. Do not just iterate over all the characters one by one, that's.. very inefficient. – Martijn Pieters Aug 10 '18 at 11:25
  • This doesn't work with `♕ ♔NAFSET ♕`. May be those characters arenot emojies. – heyxh Jan 08 '20 at 12:03
  • 11
    The code in Edits will throw an error if the `text` is already decoded. In that case, the return statement should be `return emoji.get_emoji_regexp().sub(r'', text)` where we drop the unnecessary `.decode('utf8')` – Pedram Mar 14 '20 at 06:52
  • 3
    `emoji` package has internal function dedicated to emoji replacement - `emoji.replace_emoji(str, replace='')` – Ernest Jun 01 '22 at 13:10
19

Complete version to remove emojies:

import re
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)
Zoraiyo
  • 3
  • 3
Ali Tavakoli
  • 301
  • 2
  • 8
  • can you explain more specifically, what additional you give (by adding comments like other parts) – malioboro Jul 26 '18 at 06:52
  • 1
    It's *not* a perfect solution, because the Unicode 9.0 emoji are not included in the pattern. Nor are those for Unicode 10.0 or 11.0. You'll just have to keep updating the pattern. – Martijn Pieters Aug 10 '18 at 11:06
  • @MartijnPieters see my answer below! – KevinTydlacka Aug 21 '18 at 19:16
  • @KevinTydlacka: that's not a good approach either. See [my a recent answer of mine](https://stackoverflow.com/questions/51784964/remove-emojis-from-multilingual-unicode-text/51785357#51785357) that relies on a 3rd-party library to provide updated regexes. – Martijn Pieters Aug 24 '18 at 19:15
18

If you're using the example from the accepted answer and still getting "bad character range" errors, then you're probably using a narrow build (see this answer for more details). A reformatted version of the regex that seems to work is:

emoji_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"  # emoticons
    u"(\ud83c[\udf00-\uffff])|"  # symbols & pictographs (1 of 2)
    u"(\ud83d[\u0000-\uddff])|"  # symbols & pictographs (2 of 2)
    u"(\ud83d[\ude80-\udeff])|"  # transport & map symbols
    u"(\ud83c[\udde0-\uddff])"  # flags (iOS)
    "+", flags=re.UNICODE)
Community
  • 1
  • 1
scwagner
  • 3,975
  • 21
  • 16
16

Accepted answer, and others worked for me for a bit, but I ultimately decided to strip all characters outside of the Basic Multilingual Plane. This excludes future additions to other Unicode planes (where emoji's and such live), which means I don't have to update my code every time new Unicode characters are added :).

In Python 2.7 convert to unicode if your text is not already, and then use the negative regex below (subs anything not in regex, which is all characters from BMP except for surrogates, which are used to create 2 byte Supplementary Multilingual Plane characters).

NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))
KevinTydlacka
  • 1,263
  • 1
  • 15
  • 30
  • Thank you for sharing. The ranges above do not filter characters like this one: I don't even know what this is because I cannot see it in my browser, hope it is not something insulting :D – Teddy Markov May 15 '17 at 19:01
  • This is the most robust answer. For Python 3, the last line becomes `cleaned_text = NON_BMP_RE.sub(u"", text)`. – pir Jun 04 '21 at 17:59
9

I tried to collect the complete list of unicodes. I use it to extract emojis from tweets and it work very well for me.

# Emojis pattern
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)
Chiheb.K
  • 156
  • 1
  • 4
9

I was able to get rid of the emoji in the following ways.

emoji install https://pypi.org/project/emoji/

$ pip3 install emoji
import emoji

def remove_emoji(string):
    return emoji.get_emoji_regexp().sub(u'', string)

emojis = '(`ヘ´) ⭕⭐⏩'
print(remove_emoji(emojis))

## Output result
(`ヘ´)
jojo
  • 569
  • 7
  • 7
8

The best solution to this will be to use an external library emoji . This library is continuosly updated with latest emojis and thus can be used to find them in any text. Unlike the ascii decode method which remove all unicode characters this method keeps them and only remove emojis.

  1. First install emoji library if you don't have: pip install emoji
  2. Next import it in your file/project : import emoji
  3. Now to remove all emojis use the statement: emoji.get_emoji_regexp().sub("", msg) where msg is the text to be edited

That's all you need.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
kshubham506
  • 101
  • 1
  • 2
  • 1
8

Use the Demoji package, https://pypi.org/project/demoji/

import demoji

text=""
emoji_less_text = demoji.replace(text, "")
user9225268
  • 89
  • 1
  • 1
6

I found two libs to replace emojis:

Emoji: https://pypi.org/project/emoji/

import emoji
string = "  "
emoji.replace_emoji(string, replace="!")

Demoji: https://pypi.org/project/demoji/

import demoji
string = "  "
demoji.replace(string, repl="!")

Both of them have other useful methods.

Falko
  • 17,076
  • 13
  • 60
  • 105
Heloisa Rocha
  • 152
  • 1
  • 6
4

this is my solution. This solution removes additional man and woman emoji which cant be renered by python ‍♂ and ‍♀

emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       u"\U0001f926-\U0001f937"
                       u"\u200d"
                       u"\u2640-\u2642" 
                       "]+", flags=re.UNICODE)
otto
  • 1,815
  • 7
  • 37
  • 63
4

This is the easiest code for remove all emoji.

import emoji

def remove_emojis(text: str) -> str:
    return ''.join(c for c in text if c not in emoji.UNICODE_EMOJI)

pip install emoji

Nori
  • 2,340
  • 1
  • 18
  • 41
3

Because [...] means any one of a set of characters, and because two characters in a group separated by a dash means a range of characters (often, "a-z" or "0-9"), your pattern says "a slash, followed by any characters in the group containing x, {, 1, F, 6, 0, 1, the range } through x, {, 1, F, 6, 4, f or }" followed by a slash and the letter u". That range in the middle is what re is calling the bad character range.

Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685
3

Here's a Python 3 script that uses the emoji library's get_emoji_regexp() - as suggested by kingmakerking and Martijn Pieters in their answer/comment.

It reads text from a file and writes the emoji-free text to another file.

import emoji
import re


def strip_emoji(text):

    print(emoji.emoji_count(text))

    new_text = re.sub(emoji.get_emoji_regexp(), r"", text)

    return new_text


with open("my_file.md", "r") as file:
    old_text = file.read()

no_emoji_text = strip_emoji(old_text)

with open("file.md", "w+") as new_file:
    new_file.write(no_emoji_text)
jeffhale
  • 3,759
  • 7
  • 40
  • 56
2

Converting the string into another character set like this might help:

text.encode('latin-1', 'ignore').decode('latin-1')

Kind regards.

Tobias Ernst
  • 4,214
  • 1
  • 32
  • 30
2

I know this may not be directly related to question asked but It is helpful in solving the parent problem that is removing emojis from text. There is a module named demoji in python which does this task very accurately and removes almost all types of emojis. It also updates regularly to provide up to date emoji removal support. For removing an emoji demoji.replace(text, '') is used.

Johannes Pertl
  • 833
  • 8
  • 23
Shaani
  • 49
  • 5
1

Tried all the answers, unfortunately, they didn't remove the new hugging face emoji or the clinking glasses emoji or , and a lot more.

Ended up with a list of all possible emoji, taken from the python emoji package on github, and I had to create a gist because there's a 30k character limit on stackoverflow answers and it's over 70k characters.

Computer's Guy
  • 5,122
  • 8
  • 54
  • 74
  • When i tried your list i got this error `TypeError: compile() got multiple values for argument 'flags'` on python3 – Sohaib Farooqi Jun 22 '18 at 07:38
  • @bro-grammer just remove the extra "," and it will work. – Leonardo Neves Aug 21 '18 at 17:33
  • try this ```result = re.sub('[(\U0001F600-\U0001F92F|\U0001F300-\U0001F5FF|\U0001F680-\U0001F6FF|\U0001F190-\U0001F1FF|\U00002702-\U000027B0|\U0001F926-\U0001FA9F|\u200d|\u2640-\u2642|\u2600-\u2B55|\u23cf|\u23e9|\u231a|\ufe0f)]+','', text_with_emojis)``` This removes almost all the emojis – Nimin Unnikrishnan Apr 24 '20 at 10:57
1

I simply removed all the special characters using regex and this worked for me.

sent_0 = re.sub('[^A-Za-z0-9]+', ' ', sent_0)
1

If you are asking for:

  1. python2.7
  2. Chinese, EN letters, numbers
def filter_str(desstr):
    # 过滤除中英文及数字以外的其他字符
    return ''.join(re.findall(u'[\u4e00-\u9fa5a-zA-Z0-9]', desstr))
Hi computer
  • 946
  • 4
  • 8
  • 19
0

For me the following worked in python 3.8 for substituting emojis:

import re
result = re.sub('[(\U0001F600-\U0001F92F|\U0001F300-\U0001F5FF|\U0001F680-\U0001F6FF|\U0001F190-\U0001F1FF|\U00002702-\U000027B0|\U0001F926-\U0001FA9F|\u200d|\u2640-\u2642|\u2600-\u2B55|\u23cf|\u23e9|\u231a|\ufe0f)]+','','A quick brown fox jumps over the lazy dog')

Its a much simplified version of the answers given here. I tested this code for i18n support, tested with english,russian,chinese and japanese. only emojis were removed.

This is not an exhaustive list , may have missed some emojis, but works for most of the common emojis

0

For those still using Python 2.7, this regex might help:

(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c\ude32-\ude3a]|[\ud83c\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])

So to use it in your code, it will somewhat look like this:

emoji_pattern = re.compile(
    u"(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c\ude32-\ude3a]|[\ud83c\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])"
    "+", flags=re.UNICODE)

Why is this still needed when we actually don't use Python 2.7 that much anymore these days? Some systems/Python implementations still use Python 2.7, like Python UDFs in Amazon Redshift.

0

I also wanted to remove emojis from a text file. But most of the solutions gave ranges of Unicode to remove emojis, it is not a very appropriate way to do. The remove_emoji method is an in-built method, provided by the clean-text library in Python. We can use it to clean data that has emojis in it. We need to install it from pip in order to use it in our programs:

pip install clean-text

We can use the following syntax to use it:

#import clean function
from cleantext import clean

#provide string with emojis
text = "Hello world!"

#print text after removing the emojis from it
print(clean(text, no_emoji=True))

Output:

Hello world!
0

Use clean-text library:

  1. pip install clean-text
  2. text = clean(text, no_emoji=True)

Quick test:

text = "This sample text contains laughing emojis         "
text = clean(text, no_emoji=True)
print(text)

This library also has some other great methods to process text.

Source: https://www.educative.io/answers/how-to-remove-emoji-from-the-text-in-python.

Alvaro Rodriguez Scelza
  • 3,643
  • 2
  • 32
  • 47
0

The code works for me, but before we need to install emoji package

pip install emoji==2.7.0

Code

import emoji


def delete_emojis(text):
    return emoji.replace_emoji(text)


import pytest

@pytest.mark.parametrize(
    "text, expected",
    [
        # 
        ("Hello, World!", "Hello, World!"),
        ("Hello, World!", "Hello, World!"),
        ("Hello, World!", "Hello, World!"),
        ("Hello, World!)", "Hello, World!)"),
        ("Hello, World!", "Hello, World!"),
    ],
)
def test_delete_emojis(text, expected):
    assert delete_emojis(text) == expected
  • `emoji.replace_emoji()` may be too aggressive. It will remove emoji even when rendered as text (using VS-15). It also removes any characters that default to text presentation but could be rendered as emoji. For instance copyright sign, etc, are usually rendered as text, but could be rendered as emoji using VS-16, but this code will strip out copyright, trademark, registered trademark signs when rendered as text. Many other text characters may be stripped as well. – Andj Aug 02 '23 at 05:59
-1

This does more than filtering out just emojis. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophe and quotation.

unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")

This is robust, I use it with some more guards:

import unicodedata

def neutralize_unicode(value):
    """
    Taking care of special characters as gently as possible

    Args:
        value (string): input string, can contain unicode characters

    Returns:
        :obj:`string` where the unicode characters are replaced with standard
        ASCII counterparts (for example en-dash and em-dash with regular dash,
        apostrophe and quotation variations with the standard ones) or taken
        out if there's no substitute.
    """
    if not value or not isinstance(value, basestring):
        return value

    if isinstance(value, str):
        return value

    return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")

This is python 2.

Csaba Toth
  • 10,021
  • 5
  • 75
  • 121