0

I have 1 GB csv file and I can not read that log file and give same error in both python and pandas code in my csv file, it is not a value of more than one column because there is only a single column value and All of my CSV values is number

with open("/Users/kiya/sep_sent.csv", encoding='utf-8') as f:
for i in f: 
   print(i.strip())

another method:

with open("/Users/kiya/sep_sent.csv",encoding='cp1252') as f:
    for i in f:
      print(i.strip())

Traceback (most recent call last):
  File "/Users/kiya/test8.py", line 5, in <module>
    for i in f:
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 159: character maps to <undefined>

pandas code:

import pandas as pd
df = pd.read_csv("/Users/kiya/sep_sent.csv", encoding="utf-8")
print(df)

my csv value like:

0
0
0
....
5294751024

error:

0
0
0
0
0
Traceback (most recent call last):
  File "/Users/kiya//test8.py", line 4, in <module>
    for i in f:
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 52: invalid start byte
warezers
  • 174
  • 10

2 Answers2

0

You can pass the encoding argument to read_csv as well

df = pd.read_csv("/Users/kiya/sep_sent.csv", encoding="utf-8")
  • It shows that same error – warezers Oct 03 '18 at 13:30
  • Looks like you have non unicode characters in your text. See @Goyo's answer in comments above. To find out where the issue is in the file, you might want to consider chunking up the file into smaller units to identify which unit the problem is in – Arjun Venkatraman Oct 13 '18 at 05:32
-1

Open the file with utf-8 encoding and it should work:

with open("/Users/kiya/sep_sent.csv", encoding='utf-8') as f:
    for i in f: 
       print(i.strip())
return42
  • 543
  • 3
  • 10