3

Let's say we have following strings containing emojis:

sent1 = '  right'
sent2 = 'Some text?! '
sent3 = ''

The task is to remove text and get the following output:

sent1_emojis = '  '
sent2_emojis = ' '
sent3_emojis = '' 

Based on past question (Regex Emoji Unicode) I use the following regex to identify strings that contain at least one emoji:

emoji_pattern = re.compile(u".*(["
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                "])+", flags= re.UNICODE)

In order to get the output string I use the following:

re.match(emoji_pattern, sent1).group(0)

and so on.

There's a problem with the sent2 string. re.match(emoji_pattern, sent1).group(0) returns the whole sent2 instead of emojis only.

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
balkon16
  • 1,338
  • 4
  • 20
  • 40

2 Answers2

3

Little change in emoji_pattern will do the job:

emoji_pattern = re.compile(u"(["                     # .* removed
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                "])", flags= re.UNICODE)             # + removed

for sent in [sent1, sent2, sent3]:
    print(''.join(re.findall(emoji_pattern, sent)))




Chris
  • 29,127
  • 3
  • 28
  • 51
1

If you need to rmove text, you can do it without worrying about emojis, just use pattern that will match any character, like \w, which will match any word character (equivalent for [a-zA-Z0-9_]). If you need to match more, eg. whitespaces, use [\w\s]. If you need dots, commas, etc. use [\w\s\.,-]. Then replace any match with empty string.

This way you'll remove anything except emojis.

EDIT: I got interesting result in Python regex engine: Demo

I used [\u0000-\uFFFF], which should match ANY character. Suprisignly, it doesn't match emojis, while . (dot, meaning any character) does match emojis.

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69