0

My code looks like this: I am using PyCharm as my IDE and the csv file I'm using is from MS Excess. I've encoded the csv as UTF-8. I am trying to read the file using pandas. I want to be able to distinquish between objects and ints when I call df.info() This is also why I didn't change it to 'latin-1' or 'ISO...'

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.style.use('fivethirtyeight')  
cols = ['sentiment','id','date','query_string','user','text']  
df = pd.read_csv("trainingandtestdata\\training.1600000.processed.noemoticon.csv", header=None, 
names=cols, encoding='utf-8')#low_memory=False dtype='unicode' encoding='latin1'  
df.head()  
df.info()  
df.sentiment.value_counts()

My error looks like this:
How do I fix the can't decode bytes in position xxxx to xxxx?

"C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\Scripts\python.exe" 
"C:/Users/dashg/PycharmProjects/Twitter Sentiment/Reviewer.py"   
Traceback (most recent call last):   
  File "C:/Users/dashg/PycharmProjects/Twitter Sentiment/Reviewer.py", line 6, in <module>  
    df = pd.read_csv("trainingandtestdata\\training.1600000.processed.noemoticon.csv", header=None,  
names=cols, encoding='utf-8')#low_memory=False dtype='unicode' encoding='latin1'  
  File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site- 
packages\pandas\io\parsers.py",       line 676, in parser_f      
    return _read(filepath_or_buffer, kwds)    
  File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site- 
packages\pandas\io\parsers.py",       line 454, in _read   
    data = parser.read(nrows)   
  File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site- 
packages\pandas\io\parsers.py",  
line 1133, in read  
    ret = self._engine.read(nrows)  
  File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site- 
packages\pandas\io\parsers.py",   line 2037, in read  
    data = self._reader.read(nrows)  
  File "pandas\_libs\parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read  
  File "pandas\_libs\parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory  
  File "pandas\_libs\parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows  
  File "pandas\_libs\parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows  
  File "pandas\_libs\parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error      
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 51845-51846: invalid continuation 
byte

Process finished with exit code 1

3 Answers3

0

your file doesn't have utf-8 encoding format while your using encoding='utf-8' in read_csv method. use other encoding method to help you solve the problem, like 'latin' or 'ISO-8859-1'. i refer you to this link for help.

worst case scenario, if none of this works, you can read the file in 'rb' mode (open(file, 'rb')) and parse it yourself by splitting each line of data using csv delimiter!

mjrezaee
  • 1,100
  • 5
  • 9
0

I was having the same problem, but in my case the solution was really easy. My ide is PyCharm 2020.1 and the .csv have the iso-8859-1 encoding, I've tried everything without luck, so I decided to check my ide config. I went to:

  1. File
  2. Settings
  3. Left column: Editor
  4. In Editor: File encoding Then I add my .csv file with the botton: + which is in the right side, and finally change ide's config. Change it all to iso, because by default was in utf-8 and use the exact character to work with the file, in my case is: ?. Hope this work
0

Its better to save that csv into xlsx and read as

pd.read_excel