8

I'm trying to remove just Emoji from Unicode text. I tried the various methods described in another Stack Overflow post but none of those are removing all emojis / smileys completely. For example:

Solution 1:

def remove_emoji(self, string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

Leaves in in the following example:

Input: తెలంగాణ రియల్ ఎస్టేట్ 
Output: తెలంగాణ రియల్ ఎస్టేట్ 

Another attempt, solution 2:

def deEmojify(self, inputString):
    returnString = ""
    for character in inputString:
        try:
            character.encode("ascii")
            returnString += character
        except UnicodeEncodeError:
            returnString += ''
    return returnString

Results in removing any non-English character:

 Input: Testరియల్ ఎస్టేట్ A.P&T.S. 
 Output: Test  A.P&T.S. 

It removes not only all of the emoji but it also removed the non-English characters because of the character.encode("ascii"); my non-English inputs can not be encoded into ASCII.

Is there any way to properly remove Emoji from international Unicode text?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
iamabhaykmr
  • 1,803
  • 3
  • 24
  • 49
  • New emoji's are added to the Unicode standard regularly; you'll need to keep updating those regexes. – Martijn Pieters Aug 10 '18 at 10:57
  • 1
    The emoji that is left in your first example is U+1F91D, [added in Unicode 9.0](https://emojipedia.org/unicode-9.0/). And [Unicode 10.0](https://emojipedia.org/unicode-10.0/) and [Unicode 11.0](https://emojipedia.org/unicode-11.0/) have expanded the list again. I'm sure version 12.0 will require more updates. – Martijn Pieters Aug 10 '18 at 11:05

3 Answers3

29

The regex is outdated. It appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0). The other approach is just a very inefficient method of force-encoding to ASCII, which is rarely what you want when just removing Emoji (and can be much more easily and efficiently achieved with text.encode('ascii', 'ignore').decode('ascii')).

If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.

Demo using your sample inputs:

>>> print(remove_emoji(u'తెలంగాణ రియల్ ఎస్టేట్ '))
తెలంగాణ రియల్ ఎస్టేట్ 
>>> print(remove_emoji(u'Testరియల్ ఎస్టేట్ A.P&T.S. '))
Testరియల్ ఎస్టేట్ A.P&T.S. 

Note that the regex works on Unicode text, for Python 2 make sure you have decoded from str to unicode, for Python 3, from bytes to str first.

Emoji are complex beasts these days. The above will remove complete, valid Emoji. If you have 'incomplete' Emoji components such as skin-tone codepoints (meant to be combined with specific Emoji only) then you'll have much more trouble removing those. The skin-tone codepoints are easy (just remove those 5 codepoints afterwards), but there is a whole host of combinations that are made up of innocent characters such as ♀ U+2640 FEMALE SIGN or ♂ U+2642 MALE SIGN together with variant selectors and the U+200D ZERO-WIDTH JOINER that have specific meaning in other contexts too, and you can't just regex those out, not unless you don't mind breaking text using Devanagari, or Kannada or CJK ideographs, to name just a few examples.

That said, the following Unicode 11.0 codepoints are probably safe to remove (based on filtering the Emoji_Component Emoji-data designation):

20E3          ;  (⃣)     combining enclosing keycap
FE0F          ; ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; (..)  regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; (..)  light skin tone..dark skin tone
1F9B0..1F9B3  ; (..) red-haired..white-haired
E0020..E007F  ; (..)      tag space..cancel tag

which can be removed by creating a new regex to match those:

import re
try:
    uchr = unichr  # Python 2
    import sys
    if sys.maxunicode == 0xffff:
        # narrow build, define alternative unichr encoding to surrogate pairs
        # as unichr(sys.maxunicode + 1) fails.
        def uchr(codepoint):
            return (
                unichr(codepoint) if codepoint <= sys.maxunicode else
                unichr(codepoint - 0x010000 >> 10 | 0xD800) +
                unichr(codepoint & 0x3FF | 0xDC00)
            )
except NameError:
    uchr = chr  # Python 3

# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
    (0x20E3, 0xFE0F),             # combining enclosing keycap, VARIATION SELECTOR-16
    range(0x1F1E6, 0x1F1FF + 1),  # regional indicator symbol letter a..regional indicator symbol letter z
    range(0x1F3FB, 0x1F3FF + 1),  # light skin tone..dark skin tone
    range(0x1F9B0, 0x1F9B3 + 1),  # red-haired..white-haired
    range(0xE0020, 0xE007F + 1),  # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
    re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
    flags=re.UNICODE)

then update the above remove_emoji() function to use it:

def remove_emoji(text, remove_components=False):
    cleaned = emoji.get_emoji_regexp().sub(u'', text)
    if remove_components:
        cleaned = emoji_components.sub(u'', cleaned)
    return cleaned
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you for the answer. For Input `HMDA plot sales Agents` It gives `HMDA plot sales Agents `. Still not covering all the emoji's I guess. – iamabhaykmr Aug 10 '18 at 11:25
  • 1
    @ascii_walker: that's an unpaired U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 codepoint. Wether or not that's an emoji in itself is up for debate. – Martijn Pieters Aug 10 '18 at 11:27
  • 1
    @ascii_walker: clearly, the emoji package doesn't see it as an emoji; if you paired it up with a supporting emoji, it would be removed. `` for example, is removed as that's a pairing of U+1F91F U+1F3FC, which is how the pattern is to be used. – Martijn Pieters Aug 10 '18 at 11:28
  • Also for this input its not removing any of the emoji's(Not sure if they are emoji's) Input: `PlotFlatHouseSaleAdvt` Output: `PlotFlatHouseSaleAdvt` – iamabhaykmr Aug 10 '18 at 11:29
  • @ascii_walker: I get `'PlotFlatHouseSaleAdvt'` for that. Are you sure you are using my implementation? – Martijn Pieters Aug 10 '18 at 11:33
  • @ascii_walker: for the full skin-tone modifier list, see https://www.unicode.org/emoji/charts/full-emoji-modifiers.html; you'd have to separately create a regex for the [5 tone component codepoints](https://www.unicode.org/emoji/charts/full-emoji-modifiers.html#component). – Martijn Pieters Aug 10 '18 at 11:41
  • It gives `'PlotFlatHouseSaleAdvt'` after decoding the text as `remove_emoji1("PlotFlatHouseSaleAdvt".decode("utf-8")` – iamabhaykmr Aug 10 '18 at 11:44
  • FYI: I am using python 2.7 and reading those text from a csv file – iamabhaykmr Aug 10 '18 at 11:44
  • 2
    @ascii_walker: right, I was assuming Python 3 (Python 2.7 is very close to End Of Life, you should really consider upgrading!). The regex is aimed at Unicode text, handling Emoji in regexes as UTF-8 sequences opens up another huge can of worms. I'm not going to go there today. – Martijn Pieters Aug 10 '18 at 11:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177803/discussion-between-ascii-walker-and-martijn-pieters). – iamabhaykmr Aug 10 '18 at 12:09
  • Thanks a lot Martijn for your effort. Its working perfectly. – iamabhaykmr Aug 11 '18 at 06:08
1

The emoji.get_emoji_regexp() is outdated.

If you want to remove emoji from strings, you can use emoji.replace_emoji() as shown in the examples below.

import emoji

def remove_emoji(string):
    return emoji.replace_emoji(string, '')

Visit https://carpedm20.github.io/emoji/docs/api.html#emoji.replace_emoji

Nimda
  • 21
  • 2
0

If you use the regex library instead of the re library you get access to Unicode properties then you can change your function to

def remove_emoji(self, string):
    emoji_pattern = re.compile("[\P{L}&&\P{D}&&\P{Z}&&\P{M}]", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

Which will keep all letters, digits, separators and marks (accents)

JGNI
  • 3,933
  • 11
  • 21
  • 2
    That's still *not enough*. You forgot `\P{P}` for starters. And `Sm` and `Sm` should be fine too; emoji are mostly `So` and `Sk` category symbols. Except for the ones that are not; the Emoji sequences in the `emoji.UNICODE_EMOJI` mapping fall in to the categories Cf, Cn, Ll, Me, Mn, Nd, Pd, Po, Sk, Sm and So, so your patten would actually *leave some Emoji in place*. – Martijn Pieters Aug 10 '18 at 12:03
  • 2
    Note that a lot of Emoji are formed from a combination of codepoints. For example, `'\U0001f477\U0001f3ff\u200d\u2640\ufe0f'` is *one* Emoji: ‍♀️. Your regex would leave the last codepoint in place, so `♀️`. That can get confusing. – Martijn Pieters Aug 10 '18 at 12:12
  • If I remember correctly the latest version of Unicode has an `emoji` property, but I don't know which code points it covers – JGNI Aug 10 '18 at 12:22
  • Unicode 11 has [such Emoji-related properties](http://unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files), but you risk leaving behind `Emoji_Component` codepoints still. *And* they are not formerly part of the UCD. – Martijn Pieters Aug 10 '18 at 12:27
  • 1
    Also see [the full list of what codepoint has what Emoji property](https://unicode.org/Public/emoji//11.0/emoji-data.txt). Note however that the property is pretty useless. Digits have the property, as do `#` and `*`. – Martijn Pieters Aug 10 '18 at 12:30