0

I am using Python version 3.5.3 and Pandas version 0.20.1

I use read_csv to read in csv files. I use a file pointer according to this post (I prefer this over the solution using _enablelegacywindowsfsencoding()). The following code works:

import pandas as pd

with open("C:/Desktop/folder/myfile.csv") as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

This does work. However, when there is a special character like ä in the filename as follows:

import pandas as pd

with open("C:/Desktop/folderÄ/myfile.csv") as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

Python displays an error message: (unicode error) 'utf-8' codec can't decode byte oxc4 in position 0: unexpected end of data.

I also tried to add a 'r' before the filepath, however I get the same error message, except that now I get a position as integer number which is exactly where my special character is in the filepath.

So the reason is the special character in the filepath name.

(Not a decode error which can be solved by using encoding="utf-8" or any other like ISO-5589-1. To be absolutely sure, I tried it with the following encodings and always got the same error message: utf-8, ISO-5589-1, cp1252)

BertHobe
  • 217
  • 1
  • 14

2 Answers2

0

The error indicates your source file (not the data file) is not encoded in UTF-8. In Python 3, your source file must either be saved in UTF-8 encoding, or you must declare the encoding that the source file is saved in with a special comment, e.g. #coding=Windows-1252 at the top of the file. \xc4 is the Windows-1252 encoding of Ä and is the default encoding for Western European and US Windows, so it's a good guess. Ideally, re-save your source in UTF-8.

For example, if the source is Windows-1252-encoded and the data file is GB2312-encoded (Chinese):

#coding=Windows-1252                         # encoding of source file
import pandas as pd
with open('DÄTÄ.csv',encoding='gb2312') as f:  # encoding of data file
    data = pd.read_csv(f)

Note that source files default to UTF-8 encoding, but open defaults to the encoding returned by locale.getpreferredencoding(FALSE). Since that varies with OS and configuration, it is best to always specify the encoding when opening files.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • The file was exported as UTF-8. I opened the file and re-saved it with the text editor as UTF-8. As I said, when I try it with the filepath without "Ä" it does work. I tried it again with a "Ä" in the file path name. I get the same error message. I tried your code (except I did not use gb2312, I tried it with utf-8, utf8-sig and latin. Furthermore first I used your f and datafile, second I tried it with my fp). So I test it with exactly the same file. First test is that I use a file path name without a "Ä" in it and it works. Then I try it with an "Ä" in the filepath name and error message. – BertHobe Nov 26 '20 at 07:39
-1

Try using unicode file paths u'path/to/files' for example

import pandas as pd

with open(u'C:/Desktop/folderÄ/myfile.csv') as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")
Paul Brennan
  • 2,638
  • 4
  • 19
  • 26