0

I have this file with some lines that contain some unicode literals like: "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."

I want to remove those xe2\x80\x99 like characters.

I can remove them if I declare a string that contains these characters but my solutions don't work when reading from a CSV file. I used pandas to read the file.

SOLUTIONS TRIED 1.Regex 2.Decoding and Encoding 3.Lambda

  1. Regex Solution
line =  "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
code = (re.sub(r'[^\x00-\x7f]',r'', line))
print (code)
  1. LAMBDA SOLUTION
stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)
  1. ENCODING SOLUTION
code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)

HOW FILE WAS READ

df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
    print(stripped(row['text']))
    print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
    print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))

SUGGESTED METHOD

df = pandas.read_csv('file.csv',encoding = "utf-8")

for index, row in df.iterrows():
    en = row['text'].encode()
    print(type(en))
    newline = en.decode('utf-8')
    print(type(newline))
    print(repr(newline))
    print(newline.encode('ascii', 'ignore'))
    print(newline.encode('ascii', 'replace')) 
The Lop
  • 3
  • 1
  • 3
  • Your supposed binary string `line = "b'Who\xe2\x80\x99s he?..."` is actually not one. Change it to `line = b'Who\xe2\x80\x99s he?...'`. – Finomnis Jun 12 '19 at 14:52
  • The brackets in the last row of your 'How file was read' section don't match up – Finomnis Jun 12 '19 at 21:44

1 Answers1

0

Your string is valid utf-8. Therefore it can be directly converted to a python string.

You can then encode it to ascii with str.encode(). It can ignore non-ascii characters with 'ignore'.

Also possible: 'replace'

line_raw =  b'Who\xe2\x80\x99s he?'

line = line_raw.decode('utf-8')
print(repr(line))

print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))
'Who’s he?'
b'Whos he?'
b'Who?s he?'

To come back to your original question, your 3rd method was correct. It was just in the wrong order.

code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)

To finally provide a working pandas example, here you go:

import pandas

df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
    print(row['text'].encode('ascii', 'ignore'))

There is no need to do decode('utf-8'), because pandas does that for you.

Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing

text = row['text'].encode('ascii', 'ignore').decode('ascii')

This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.

You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.

Finomnis
  • 18,094
  • 1
  • 20
  • 27
  • Your code works as it but still doesn't work after reading the data from a file. – The Lop Jun 12 '19 at 16:26
  • I'm pretty sure you don't connect it correctly to the file reading, then. I could help you if you would show me your code? – Finomnis Jun 12 '19 at 21:43
  • I added the code above under the suggested method title. You will see that I encoded the text first before decoding. If I try to decode first, I get an error saying str has no method decode. Encoding first converts the string to the Bytes class. Thanks for all you help so far. – The Lop Jun 13 '19 at 05:22
  • Still no, why would you do `en = row['text'].encode()` and then `newline = en.decode('utf-8')`? `encode()` converts from text to utf-8 binary, and `decode('utf-8')` converts from utf-8 binary to text. You can combine those two lines to simply do `newline = row['text']` – Finomnis Jun 13 '19 at 06:42
  • default strings, represented as `'abcd'` are python internal representations of text and can hold *all* characters that exist. That is what pandas returns, your `row['text']` is already in python internal representation, and all unicode characters are correct. If you then want to store that string as ascii, you need to strip out all non-ascii characters. Which you do with `ascii_bytes = row['text'].encode('ascii', 'ignore')`. You then have a bytes array with ascii encoding, commonly displayed as `b'text'`. The `b'` indicates that it is a bytes-string. – Finomnis Jun 13 '19 at 06:45
  • You then have to look if your csv library / file access needs bytes or python strings as inputs, depending on that you can either directly use `ascii_bytes`, or you have to convert it back to python string with `ascii_bytes.decode('ascii')` – Finomnis Jun 13 '19 at 06:46
  • Nonetheless, you are getting the correct idea now I think :) Feel free to mark my answer as correct if you are satisfied :) – Finomnis Jun 13 '19 at 06:56