Converting "UTF-8" characters to ASCII in a String?

Question

CSV file has four columns:- tweet_id, created_at, tweet_text, tweet_media_url

tweet_text is already UTF-8 encoded

import csv
f = open('tweets.csv')
csv_f = csv.reader(f)
#==============================================================================
tweet_text= []
for row in csv_f:
    tweet_text.append(row[2])
#==============================================================================
def deEmojify(inputString):
    inputString= inputString.encode('ascii', 'ignore').decode('ascii')
    return inputString
#===============================================================================
text1="b'@JWSpry Have some fun with this! \xf0\x9f\x98\x82 I can only post four at a time - a few more are coming."
text2=deEmojify(text1)
print(text2)

output - b'@JWSpry Have some fun with this! I can only post four at a time - a few more are coming.

print(tweet_text[7])

output -b'@JWSpry Have some fun with this! \xf0\x9f\x98\x82 I can only post four at a time - a few more are coming.

text3=deEmojify(tweet_text[7])
print(text3)

output -b'@JWSpry Have some fun with this! \xf0\x9f\x98\x82 I can only post four at a time - a few more are coming.

why code is working fine for text1(which I have just copied and pasted from csv) but not for tweet_text[7]?

Can you clarify what your question is? See [ask], [help/on-topic]. — AMC, Apr 02 '20 at 19:35
Well: you could 1) detect them 2) remove them (just cut out 4 bytes in this case) — wildplasser, Apr 02 '20 at 19:53
Hi this might answer your question https://stackoverflow.com/a/36217640 — bmcculley, Apr 02 '20 at 20:11
This question is a bit confusing, as you've got some code (the CSV handling stuff) that doesn't seem related at all to the things you're asking about (the emoji filtering stuff). Can you better separate out your example from what I'm guessing is your real code? A further issue is that your example string (`text1`) appears to be the string representation of a `bytes` object. That's not a very good format to be using, if you can avoid it. If you can fix whatever is generating your CSV (or wherever that string comes from) to be properly formatted, you'd have a much easier time working with it. — Blckknght, Apr 02 '20 at 20:32
`b'\xf0\x9f\x98\x82'.decode('utf-8')` returns `''` _FACE WITH TEARS OF JOY_. So `print(tweet_text[7].decode('utf-8'))` could help. — JosefZ, Jan 31 '21 at 22:11

Converting "UTF-8" characters to ASCII in a String?

0 Answers0