Remove \xDD substrings from Pandas Dataframe

Question

I have a Pandas dataframe, which has 'Tweet' column containing some of its data like this:

Tweet

 Ya bani taplak dkk \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84
Setidaknya gw punya jari tengah buat lu, sebelom gw ukur nyali sama bacot lu \xf0\x9f\x98\x8f'
Ari sarua beki mah repeh monyet\xf0\x9f\x98\x86\xf0\x9f\x98\x86'
 Cerita silat lae \xf0\x9f\x98\x80 semacam Kho Ping Hoo yang dari Indonesia, tapi Liang Ie Shen penulis dari China

As you see, these codes are Emoji bytes code. For example, the first row's original form is " Ya bani taplak dkk ", where is denoted by \xf0\x9f\x98\x84. I've created an emoji list which contains the code based on this site, and I want to remove these codes from tweet data, so my desired result for first row is " Ya bani taplak dkk ".

I tried to apply the answer of this problem for dataframe, but it is not working. I suspected at first that perhaps because most of the bytes codes are connected without space, as you can see from first and third row. However, even the second and fourth row were not altered too. Here is my code so far:

df = pd.read_csv(tweet_data, sep='\t')
df2 = pd.read_csv(emoji_data, sep='\t')

emoji_list = df2['Code 2'].tolist()

df['Tweet'] = df['Tweet'].str.replace(r'\\n', '').str.replace(r'RT', '').str.replace(r'USER', '').str.replace(r'URL', '')

p = re.compile('|'.join(map(re.escape, emoji_list)))
df['Tweet'] = [p.sub('', text) for text in df['Tweet']]

Any help appreciated, thank you.

You may find an emoji pattern [here](https://stackoverflow.com/a/56626951/3832970). Probably, `emoji` module will be of help, too. — Wiktor Stribiżew, Mar 04 '20 at 00:12
Something is just wrong here, nothing works with your input. Could you please provide a **reproducible** example? If `s = "\U0001F604 here"`, all works well. — Wiktor Stribiżew, Mar 04 '20 at 10:18
@WiktorStribiżew unfortunately, the available data provides this form of emoji — rayyar, Mar 04 '20 at 13:28
What is the data? Provide a sample to us to repro the issue. Or go with solutions like the one below — Wiktor Stribiżew, Mar 04 '20 at 13:32
@WiktorStribiżew the data is just like above, I have a dataframe which has Tweet column, contains tweet data, some of them are like four data above which has emoji byte code. As you said, it should use unicode (like this 'U0001F604') in order to be successfully processed by Python. Currently I tried to map the current emoji codes with the right one — rayyar, Mar 04 '20 at 13:54
So, to correctly repro the issue, we should define the sample string literal as `text = "Ya bani taplak dkk \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84"`? Not as `text = b"Ya bani taplak dkk \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84"`? Or any other way? — Wiktor Stribiżew, Mar 04 '20 at 13:56
So does the solution below help? Look, it has got 2 upvotes. — Wiktor Stribiżew, Mar 04 '20 at 15:09

score 0 · Answer 1 · answered Mar 04 '20 at 02:16

If you are handling tweets data, I have a function to clean it.

import re
from nltk.tokenize import WordPunctTokenizer

def clean_tweets(tweet):
    user_removed = re.sub(r'@[A-Za-z0-9]+','',tweet)
    link_removed = re.sub('https?://[A-Za-z0-9./]+','',user_removed)
    only_alphanumeric = re.sub('[^a-zA-Z0-9]', ' ', user_removed)
    lower_case_tweet = only_alphanumeric .lower()
    tok = WordPunctTokenizer()
    words = tok.tokenize(lower_case_tweet)
    clean_tweet = (' '.join(words)).strip()
    return clean_tweet

Then you only need to apply this function to your column that contains the tweet data.

df['Tweet'] = df['Tweet'].apply(clean_tweets)

If you want specific code to remove the emoji, that's re.sub('[^a-zA-Z0-9]', ' ', tweet), it will filter the string so it only contains alphanumeric character. Hope it helps.

This solution removes too much, e.g. Russian (`ф`) or Polish letters (like `ą`). — Wiktor Stribiżew, Mar 04 '20 at 10:20

rayyar · Answer 2 · 2020-03-05T23:19:44.037

0

So, I've found the answer. It took so long because I tried to experiment the solution without dataframe. Consider this:

text = 'Ya bani taplak dkk \xf0\x9f\x98\x84'
removed = re.sub(r"\\x[A-Za-z0-9./]+", "", text)

This is not working. However, if you put r to indicate raw string before text, as like this:

removed = re.sub(r"\\x[A-Za-z0-9./]+", "", r'Ya bani taplak dkk \xf0\x9f\x98\x84')

it works, and will print "Ya bani taplak dkk". Foolishly, I spent quite a long time to find a way to implement this approach for dataframe, and finally I just try this code to see how it works without high expectation:

df['Tweet'] = df['Tweet'].str.replace(r'\\x[A-Za-z0-9./]+', '')

And it works right away... Perhaps pandas dataframe already adapt the text data so you don't have to read it by r (raw string). But that is just my weak assumption. If anyone could give a sound explanation for this, I would appreciated it. Cheers!

edited Mar 05 '20 at 23:19

answered Mar 04 '20 at 15:57

rayyar

95
1
12

But that has nothing to do with emojis. You are plainly removing a ``\x`` substring with 1 to 50 chars after it. This is rather a dangerous and fragile solution, you may remove real data. Do you just want to remove `'x\d+` consecutive patterns? Use `df['Tweet'] = df['Tweet'].str.replace(r'(?:\\x\d+)+', '')` – Wiktor Stribiżew Mar 05 '20 at 11:00
@WiktorStribiżew, yes, I already changed the regex to make it more robust... sorry for the confusing terms, in my data all emojis are coded with initial '\x'. – rayyar Mar 05 '20 at 23:22
I posted a [solution for the exact problem](https://stackoverflow.com/a/60560359/3832970) you stated in the question. Your current regex may overfire, see [this regex demo](https://regex101.com/r/d2nZOD/5). – Wiktor Stribiżew Mar 06 '20 at 08:46

score 0 · Answer 3 · answered Mar 06 '20 at 08:43

To remove any one or more repetitions of a literal \x substring followed with two hex chars in Python, you may use

(?:\\x[A-Fa-f0-9]{2})+

See the regex demo.

Here are some examples:

import re
rx = r"\s*(?:\\x[A-Fa-f0-9]{2})+"
text = r"Ya bani taplak dkk \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84"
print( re.sub(rx, '', text) )
# => Ya bani taplak dkk

The \s* matches 0+ whitespaces used to left-trim the removed match.

In Pandas, use Series.str.replace:

df['Tweet'] = df['Tweet'].str.replace(r"\s*(?:\\x[A-Fa-f0-9]{2})+", "")

Remove \xDD substrings from Pandas Dataframe

3 Answers3