I have a Pandas dataframe, which has 'Tweet' column containing some of its data like this:
Tweet
Ya bani taplak dkk \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84
Setidaknya gw punya jari tengah buat lu, sebelom gw ukur nyali sama bacot lu \xf0\x9f\x98\x8f'
Ari sarua beki mah repeh monyet\xf0\x9f\x98\x86\xf0\x9f\x98\x86'
Cerita silat lae \xf0\x9f\x98\x80 semacam Kho Ping Hoo yang dari Indonesia, tapi Liang Ie Shen penulis dari China
As you see, these codes are Emoji bytes code. For example, the first row's original form is " Ya bani taplak dkk ", where is denoted by \xf0\x9f\x98\x84. I've created an emoji list which contains the code based on this site, and I want to remove these codes from tweet data, so my desired result for first row is " Ya bani taplak dkk ".
I tried to apply the answer of this problem for dataframe, but it is not working. I suspected at first that perhaps because most of the bytes codes are connected without space, as you can see from first and third row. However, even the second and fourth row were not altered too. Here is my code so far:
df = pd.read_csv(tweet_data, sep='\t')
df2 = pd.read_csv(emoji_data, sep='\t')
emoji_list = df2['Code 2'].tolist()
df['Tweet'] = df['Tweet'].str.replace(r'\\n', '').str.replace(r'RT', '').str.replace(r'USER', '').str.replace(r'URL', '')
p = re.compile('|'.join(map(re.escape, emoji_list)))
df['Tweet'] = [p.sub('', text) for text in df['Tweet']]
Any help appreciated, thank you.