I have this file with some lines that contain some unicode literals like: "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
I want to remove those xe2\x80\x99 like characters.
I can remove them if I declare a string that contains these characters but my solutions don't work when reading from a CSV file. I used pandas to read the file.
SOLUTIONS TRIED 1.Regex 2.Decoding and Encoding 3.Lambda
- Regex Solution
line = "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
code = (re.sub(r'[^\x00-\x7f]',r'', line))
print (code)
- LAMBDA SOLUTION
stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)
- ENCODING SOLUTION
code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)
HOW FILE WAS READ
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
print(stripped(row['text']))
print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))
SUGGESTED METHOD
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
en = row['text'].encode()
print(type(en))
newline = en.decode('utf-8')
print(type(newline))
print(repr(newline))
print(newline.encode('ascii', 'ignore'))
print(newline.encode('ascii', 'replace'))