I keep getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

Question

I'm trying to read monthly csv file but for some reason, I keep getting this error.

This is my code below.

df = pd.DataFrame()
 
for file in os.listdir("Performance_Data"):
    if file.endswith(".csv"):
        df = pd.concat([df , pd.read_csv(os.path.join("Performance_Data", file))], axis=0 )
        
df.head()

What do I do?

It may not be a utf-8 encoded file. You can open it in `notepad++` and at the bottom it will show the encoding. Also ensure that it is in fact a comma delimited file and not tab or | If you see a diff encoding just use `encoding='utf-16'` or whatever it is in the read_csv — Chris, Dec 16 '21 at 14:39

user23952 · Answer 1 · 2021-12-16T15:32:24.477

0

Pandas assumes by default that your file is encoded in UTF-8. Your file is encoded in Windows-1252. You can tell Pandas to use this encoding by

pd.read_csv(os.path.join("Performance_Data", file), encoding='cp1252')

Detecting the encoding of a file automatically is a bit tricky, but you can use a package called "chardet". For your code, it could look like this:

import os

import chardet
import pandas as pd

df = pd.DataFrame()

for file in os.listdir("Performance_Data"):
    if file.endswith(".csv"):
        with open(file, "rb") as fp:
            encoding = chardet.detect(fp.read())["encoding"]
        df = pd.concat(
            [
                df,
                pd.read_csv(os.path.join("Performance_Data", file), encoding=encoding),
            ],
            axis=0,
        )

df.head()

References

Pandas read_csv documentation.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte, a relevant earlier question on Stack Overflow.

edited Dec 16 '21 at 15:32

answered Dec 16 '21 at 14:46

user23952

578
3
10

1

Thanks for this, the code ran but however, I got this error as a follow up UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 253698: character maps to – David Dec 16 '21 at 14:51
Maybe your file is not Windows-1252 after all. Could you check it on https://www.freeformatter.com/convert-file-encoding.html? – user23952 Dec 16 '21 at 14:57
Not working as my file is bigger than 2MB – David Dec 16 '21 at 15:14
Could work with ```chardet```, although a bit hacky. Edited the answer. – user23952 Dec 16 '21 at 15:27
How do I do that? – David Dec 16 '21 at 16:33
Run ```pip install chardet```, and use the code above. Post the error message if it doesn't work :) – user23952 Dec 16 '21 at 16:42

I keep getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

1 Answers1