Extract Unicode-Emoticons in list, Python 3.x

Question

I work on some twitter-data and I want to filter the emoticons in a list. The data itself is encoded in utf8. I read the file in line by line like these three example lines:

['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']
['This', 'is', 'another', 'tweet', 'with', 'a', 'emoticon', '']
['This', 'tweet', 'contains', 'no', 'emoticon']

I'd like to collect the emoticons for each line like that:

['', '⚓️']

and so on.

I already researched and found that there's an 'emoji' package in python. I tried to use it in my code like that

import emoji

with open("file.txt", "r", encoding='utf-8') as f:
    for line in f:
        elements = []
        col = line.strip('\n')
        cols = col.split('\t')
        elements.append(cols)

        emoji_list = []
        data = re.findall(r'\X', elements)
        for word in data:
            if any(char in emoji.UNICODE_EMOJI for char in word):
                emoji_list.append(word)

First try

import emoji

with open("file.txt", "r", encoding='utf-8') as f:
    for line in f:
        elements = []
        col = line.strip('\n')
        cols = col.split('\t')
        elements.append(cols)

        emoji_list = []

        for c in elements:
            if c in emoji.UNICODE_EMOJI:
                emojilist.append(c)

Second Try

I tried the examples which were given here How to extract all the emojis from text? but they kinda didn't work for me and I'm not sure what I did wrong.

I'd really appreciate any help to extract the emoticons, thanks in advance! :)

Your indentation is wrong; after the `for line in f:` you need to indent the rest. — L3viathan, Jun 16 '18 at 13:51
I want it to create a list with emoticons for every line especially, not for the whole dataset at once. So I also need to consider the lines which contain no emoticons. — Anastasia, Jun 16 '18 at 18:17

score 2 · Accepted Answer · answered Jun 16 '18 at 13:50

2

Emojis exist in several Unicode ranges, represented by this regex pattern:

>>> import re
>>> emoji = re.compile('[\\u203C-\\u3299\\U0001F000-\\U0001F644]')

You can use that to filter your lists:

>>> list(filter(emoji.match, ['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']))
['', '⚓️']

N.B.: The pattern is an approximation and may capture some additional characters.

answered Jun 16 '18 at 13:50

L3viathan

26,748
2
58
81

Thank you very, very much! :) – Anastasia Jun 16 '18 at 16:44

Extract Unicode-Emoticons in list, Python 3.x

1 Answers1