2

I am trying to use pandas to read a csv file which is in a sunfolder of the current folder. I am on a Windows PC.

If I run:

df=pd.read_csv("subfolder//file.csv") 

I get:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 16: invalid start byte

If I run:

df=pd.read_csv("subfolder//file.csv", engine='python')

It works.

  • Why????

  • Isn't there a way to use c as the engine? It's meant to be faster

Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • Could you csv file contain a SUPERSCRIPT TWO character U+00B2 `²`? If the answer is yes, it is probably Latin1 or cp1252 encoded... – Serge Ballesta Mar 20 '19 at 11:40

1 Answers1

1

This might be because read_csv is trying to read the file in "UTF-8" format while your file is clearly in a different format. To detect the encoding in Windows, you can look at this. Get encoding of a file in Windows

After you found out the file's encoding format, you can give an argument of the encoding type to the read_csv function. e.g.

df=pd.read_csv("subfolder//file.csv", encoding="utf-8") 
Farhood ET
  • 1,432
  • 15
  • 32
  • 1
    So this means that engine='c' causes encoding to default to 'utf-8', while engine='python' means a different encoding? I double checked https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and none of this seems to be documented explicitly - as is unfortunately all too common in the beautiful world of Python... – Pythonista anonymous Mar 20 '19 at 15:33
  • @Pythonistaanonymous I don't know about the current situation explicitly, but the error you are getting is an error of encoding conflict, and I suspect this might be the case. Have you checked your file's encoding yet? – Farhood ET Mar 20 '19 at 17:50
  • 1
    Yes, if I set encoding='latin1' it works. Thanks for the help. PS still frustrated at the how much documentation sucks in the world of Python! – Pythonista anonymous Mar 20 '19 at 17:52