0

I have a dataset of tweets where it contains at least one occurrence of emoji. But sometimes there are more. Emojis can be in the middle of the sentence, or it could be at the start or at the end. Hence for each tweet the case is different. I am having difficulties trying to split only the emojis in the sentence. If I loop through each word, the multiple emojis are also considered as one word.

She is too hot for Congress.  Vote her out!  #sarcasm 

Expected output: She is too hot for Congress. Vote her out! #sarcasm

The Struggle is Real  #struggle #struggleisreal #struggles #funny #humor #saying #sarcasm #lifestruggles #sarcastic #funnysaying #sayings #thestruggleisreal 

Expected output: The Struggle is Real #struggle #struggleisreal #struggles #funny #humor #saying #sarcasm #lifestruggles #sarcastic #funnysaying #sayings #thestruggleisreal

  For More Funny Post Follow

Expected output: For More Funny Post Follow

Counter for words and emoji

Answer from the above post gives me a list and toknized words for each tweet in the dataset which I don't want, it also does not solve my problem. I do not get space between the emojis.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
CD_NS
  • 309
  • 1
  • 5
  • 14

1 Answers1

2

Using emoji library 'v1.5.0' it's an easy job.

import emoji

def extract_emojis(s):
    return ''.join((' '+c+' ') if c in emoji.UNICODE_EMOJI['en'] else c for c in s)

test:

s = " me así, seds  hello ‍ emoji hello ‍‍ how are  you today"

extract_emojis(s)

output:

'     me así, se        ds        hello     \u200d   emoji hello   \u200d  \u200d   how are    you today        '
meti
  • 1,921
  • 1
  • 8
  • 15