remove unicode emoji using re in python

Question

I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using

myre = re.compile(u'[\u1F300-\u1F5FF\u1F600-\u1F64F\u1F680-\u1F6FF\u2600-\u26FF\u2700-\u27BF]+',re.UNICODE)
print myre.sub('', text)

but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?

here is an example output that all the characters were removed:

“   '   //./” ! # # # …

Is this Python 2? Python can be build with wide or narrow Unicode support; you probably have a UCS-2 build rather than UCS-4, and that affects what you can do with regular expressions. — Martijn Pieters, Oct 26 '14 at 00:47
I was able to reproduce your issue, and I also saw that a UCS-2 build throws an exception when trying to compile the expression anyway, so that is not the issue here. — Martijn Pieters, Oct 26 '14 at 00:54
`u'\u1f300'` should be `u'\U0001f300'`. The first is `'\u1f30'` and `'0'`. — Mark Tolonen, Oct 26 '14 at 00:56

Martijn Pieters · Accepted Answer · 2019-11-27T17:00:00.560

You are not using the correct notation for non-BMP unicode points; you want to use \U0001FFFF, a capital U and 8 digits:

myre = re.compile(u'['
    u'\U0001F300-\U0001F5FF'
    u'\U0001F600-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

This can be reduced to:

myre = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

as your first two ranges are adjacent.

Your version was specifying (with added spaces for readability):

[\u1F30 0-\u1F5F F\u1F60 0-\u1F64 F\u1F68 0-\u1F6F F \u2600-\u26FF\u2700-\u27BF]+

That's because the \uxxxx escape sequence always takes only 4 hex digits, not 5.

The largest of those ranges is 0-\u1F6F (so from the digit 0 through to Ὧ), which covers a very large swathe of the Unicode standard.

The corrected expression works, provided you use a UCS-4 wide Python executable:

>>> import re
>>> myre = re.compile(u'['
...     u'\U0001F300-\U0001F64F'
...     u'\U0001F680-\U0001F6FF'
...     u'\u2600-\u26FF\u2700-\u27BF]+', 
...     re.UNICODE)
>>> myre.sub('', u'Some example text with a sleepy face: \U0001f62a')
u'Some example text with a sleepy face: '

The UCS-2 equivalent is:

myre = re.compile(u'('
    u'\ud83c[\udf00-\udfff]|'
    u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
    u'[\u2600-\u26FF\u2700-\u27BF])+', 
    re.UNICODE)

You can combine the two into your script with a exception handler:

try:
    # Wide UCS-4 build
    myre = re.compile(u'['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    myre = re.compile(u'('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+', 
        re.UNICODE)

Of course, the regex is already out of date, as it doesn't cover Emoji defined in newer Unicode releases; it appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).

If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.

Just what I was commenting above, but I get `sre_constants.error: bad character range` on Python 2 narrow build. — Mark Tolonen, Oct 26 '14 at 00:59
@MarkTolonen: yes, you can only use this on a wide build, see [Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"](http://stackoverflow.com/q/12636489) for an approach (you'll have to match the UTF-16 surrogate pairs instead). — Martijn Pieters, Oct 26 '14 at 01:04
wow, thanks! It seems the USC-4 build works properly! I'd better figure out more about USC and unicode things. One thing I'm curious is u'[' and \u27BF]. Why there is a quote here [' but no quote after \u27BF? — Young, Oct 26 '14 at 13:07
@Young I just broke up the expression across several lines to make it readable. All you are seeing there is several unicode string literals (`u'...'`) in a row which Python [merges into one string](http://stackoverflow.com/a/26433185) for you. — Martijn Pieters, Oct 26 '14 at 15:37
@MartijnPieters I did not understand why "a capital U and 8 digits" is the correct notation for non-BMP unicode points. When would I use this vs the 4 digit notation. Can you demystify this please? — Ankur Agarwal, Oct 22 '15 at 00:08
@abc: the BMP uses codepoints up to 0xFFFF. That's four digits. Anything outside of the BMP uses *more* than four hex digits, so you cannot use the `\uhhhh` 4-digit syntax for those, you need to use the `\Uhhhhhhhh` 8 digit syntax instead. — Martijn Pieters, Oct 22 '15 at 08:25
Nice! To convert the string to unicode in a function I did `lambda txt : myre.sub("", unicode(txt, "utf-8"))` and it worked with no problems. Thanks. — Marcelo Lazaroni, Nov 08 '16 at 15:09

remove unicode emoji using re in python

1 Answers1

Linked

Related