why does Unicode Decode Error message appear when I load a dataset?

Question

I have converted an Excel file to csv, the goal is to analyse this dataset with python. So after importing my modules and the Dataset by using this code

Import pandas as pd
Import numpy as np
Import matplotlib as mlt

pd.read_csv('filename.csv')

I had the following message:

"'utf-8' codec can't decode byte 0xbf in position 6: invalid start byte"

I searched on the web but none of those solutions applied to my issue and to be honest I don't know what to do.

You need to look into encoding. https://stackoverflow.com/questions/48067514/utf-8-codec-cant-decode-byte-0xa0-in-position-4276-invalid-start-byte — Vaishali, Jan 14 '19 at 23:37
Can you share filename.csv? Probably your file has another encoding. — Andre Araujo, Jan 15 '19 at 02:47

Andre Araujo · Accepted Answer · 2019-01-16T01:48:21.970

First, you need know what character encoding your file realy is. It's not UTF-8.

There are lots of different character encodings, sometimes the Excel change encoding to 'iso-8859-1' or 'cp1252', it's crazy.

Here is a important info that every IT person must know: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

To solve your problem, there at least three options:

1) Try some of possibilites (latin1, cp1252, etc):

df= pd.read_csv('file.csv',encoding ='latin1')

2) Save your files with UTF-8 encoding(or other original) before read. Probably Windows change the encoding after you open it (Excel) and update some row.

3) One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess:

import chardet

# look at the first ten thousand bytes to guess the character encoding
with open('file.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.99, 'language': ''}

# read in the file with the encoding detected by chardet
df = pd.read_csv('file.csv', encoding='Windows-1252')

why does Unicode Decode Error message appear when I load a dataset?

1 Answers1