Pandas read_csv filepath with special characters codec can't decode

Question

I am using Python version 3.5.3 and Pandas version 0.20.1

I use read_csv to read in csv files. I use a file pointer according to this post (I prefer this over the solution using _enablelegacywindowsfsencoding()). The following code works:

import pandas as pd

with open("C:/Desktop/folder/myfile.csv") as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

This does work. However, when there is a special character like ä in the filename as follows:

import pandas as pd

with open("C:/Desktop/folderÄ/myfile.csv") as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

Python displays an error message: (unicode error) 'utf-8' codec can't decode byte oxc4 in position 0: unexpected end of data.

I also tried to add a 'r' before the filepath, however I get the same error message, except that now I get a position as integer number which is exactly where my special character is in the filepath.

So the reason is the special character in the filepath name.

(Not a decode error which can be solved by using encoding="utf-8" or any other like ISO-5589-1. To be absolutely sure, I tried it with the following encodings and always got the same error message: utf-8, ISO-5589-1, cp1252)

Mark Tolonen · Answer 1 · 2020-11-26T14:36:40.873

The error indicates your source file (not the data file) is not encoded in UTF-8. In Python 3, your source file must either be saved in UTF-8 encoding, or you must declare the encoding that the source file is saved in with a special comment, e.g. #coding=Windows-1252 at the top of the file. \xc4 is the Windows-1252 encoding of Ä and is the default encoding for Western European and US Windows, so it's a good guess. Ideally, re-save your source in UTF-8.

For example, if the source is Windows-1252-encoded and the data file is GB2312-encoded (Chinese):

#coding=Windows-1252                         # encoding of source file
import pandas as pd
with open('DÄTÄ.csv',encoding='gb2312') as f:  # encoding of data file
    data = pd.read_csv(f)

Note that source files default to UTF-8 encoding, but open defaults to the encoding returned by locale.getpreferredencoding(FALSE). Since that varies with OS and configuration, it is best to always specify the encoding when opening files.

The file was exported as UTF-8. I opened the file and re-saved it with the text editor as UTF-8. As I said, when I try it with the filepath without "Ä" it does work. I tried it again with a "Ä" in the file path name. I get the same error message. I tried your code (except I did not use gb2312, I tried it with utf-8, utf8-sig and latin. Furthermore first I used your f and datafile, second I tried it with my fp). So I test it with exactly the same file. First test is that I use a file path name without a "Ä" in it and it works. Then I try it with an "Ä" in the filepath name and error message. — BertHobe, Nov 26 '20 at 07:39

score -1 · Answer 2 · answered Nov 25 '20 at 17:17

-1

Try using unicode file paths u'path/to/files' for example

import pandas as pd

with open(u'C:/Desktop/folderÄ/myfile.csv') as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

answered Nov 25 '20 at 17:17

Paul Brennan

2,638
4
19
26

1

The OP is using Python 3. The strings are already Unicode. – Mark Tolonen Nov 26 '20 at 03:22

Pandas read_csv filepath with special characters codec can't decode

2 Answers2