Filter out multiple emojis from Unicode text in Python

Question

Let's say we have following strings containing emojis:

sent1 = '  right'
sent2 = 'Some text?! '
sent3 = ''

The task is to remove text and get the following output:

sent1_emojis = '  '
sent2_emojis = ' '
sent3_emojis = ''

Based on past question (Regex Emoji Unicode) I use the following regex to identify strings that contain at least one emoji:

emoji_pattern = re.compile(u".*(["
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                "])+", flags= re.UNICODE)

In order to get the output string I use the following:

re.match(emoji_pattern, sent1).group(0)

and so on.

There's a problem with the sent2 string. re.match(emoji_pattern, sent1).group(0) returns the whole sent2 instead of emojis only.

score 3 · Accepted Answer · answered Apr 22 '19 at 09:50

Little change in emoji_pattern will do the job:

emoji_pattern = re.compile(u"(["                     # .* removed
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                "])", flags= re.UNICODE)             # + removed

for sent in [sent1, sent2, sent3]:
    print(''.join(re.findall(emoji_pattern, sent)))

Michał Turczyn · Answer 2 · 2019-04-22T10:04:39.610

If you need to rmove text, you can do it without worrying about emojis, just use pattern that will match any character, like \w, which will match any word character (equivalent for [a-zA-Z0-9_]). If you need to match more, eg. whitespaces, use [\w\s]. If you need dots, commas, etc. use [\w\s\.,-]. Then replace any match with empty string.

This way you'll remove anything except emojis.

EDIT: I got interesting result in Python regex engine: Demo

I used [\u0000-\uFFFF], which should match ANY character. Suprisignly, it doesn't match emojis, while . (dot, meaning any character) does match emojis.

Filter out multiple emojis from Unicode text in Python

2 Answers2