Remove unicode encoded emojis from Twitter tweet

Question

For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of \ud83d\udcf8 (camera emoji) or \ud83c\uddeb\ud83c\uddf7 (french flag) for example.

I am using the python-package "re" and so far I was successful in removing "simple" unicodes like \u201c (double quotation mark) with something like

text = re.sub(u'\u201c', '', text)

However, when I am trying to remove more complex structures, like for example

text = re.sub(u'\ud83d\udcf8', '', text) # remove camera emoji
text = re.sub(u'\ud83c\uddeb\ud83c\uddf7', '', text) # remove french flag emoji

nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.

EDIT: Thanks to @Shawn Shroyer's answer i found out that

text = re.sub(u'\\ud83d\\udcf8', '', text)

works fine! I just had to escape the backslashes. Now only my second problem remains (see below).

The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like \u2019 (single quotation mark).

Does this answer your question? [removing emojis from a string in Python](https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python) — cullzie, Jan 05 '21 at 16:25
[try this solution](https://stackoverflow.com/a/51785357/11794224) that uses the emoji library — Zeph, Jan 05 '21 at 17:12

score 1 · Accepted Answer · edited Jan 06 '21 at 06:24

1

My suggestion would be to create an array of values you would like to replace and you need to escape the \ by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.

import re
to_remove_arr = [u"\ud83d\udcf8", u"\ud83c\uddeb\ud83c\uddf7"]
pattern_str = "|".join(to_remove_arr)    
text = re.sub(pattern_str, "", text)

Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:

text = text.encode("ascii", "ignore").decode()

Edit: to remove only emojis I found:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    return RE_EMOJI.sub(r'', text)

edited Jan 06 '21 at 06:24

Mark Tolonen

166,664
26
169
251

answered Jan 05 '21 at 16:29

Shawn Shroyer

901
1
8
18

1

Edited answer to include re.sub and combining the array of unicode characters into a single regex expression – Shawn Shroyer Jan 05 '21 at 16:55
thank you, this way I at least would have a working solution! But it would be tiresome to collect every single emoji individually, so hopefully someone has an idea to remove them all together ;) – EXQuIsIIt Jan 05 '21 at 17:00
Sorry - based on the question I thought you were looking to only remove specific unicode characters. Edited answer with a solution to remove all non-ascii characters – Shawn Shroyer Jan 05 '21 at 17:09
1

Saw your edit and included code to strip only emojis – Shawn Shroyer Jan 05 '21 at 17:15
1

That last one was the solution I was looking for, thanks! Some emojis are still slipping through, but I just have to adjust the ranges or add more. Thank you very much! :) – EXQuIsIIt Jan 05 '21 at 17:39

Remove unicode encoded emojis from Twitter tweet

1 Answers1