6

I'm trying to load a CSV with pandas, but am running into a problem if the file name has accents. It's clearly an encoding problem, but although read_csv lets you set encoding for text within the file, I can't figure out how to encode the file name properly.

input_file = r'C:\...\Datasets\%s\Provinces\Points\%s.csv' % (country, province)
self.locs = pandas.read_csv(input_file,sep=',',skipinitialspace=True)

The CSV file is Anzoátegui.csv. When I'm getting errors,

input_file = 'C:\\...\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv

Error code:

OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist

So maybe it's converting my string to bytes? I tried using io.StringIO(input_file) as well, which puts the correct file name as a column header on an empty DataFrame:

Empty DataFrame
Columns: [C:\PF2\QGIS Valmiera\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv]
Index: []

Any ideas on how to get this file to load? Unfortunately I can't just strip out accents, as I have to interface with software that requires the proper name, and I have a ton of files to format (not just the one). Thanks!

Edit: Full error

Traceback (most recent call last):
  File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_comm.py", line 891, in doIt
    result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)
  File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_vars.py", line 486, in evaluateExpression
    result = eval(compiled, updated_globals, frame.f_locals)
  File "<string>", line 1, in <module>
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 404, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 486, in __init__
    self._make_engine(self.engine)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 594, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 952, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 330, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3040)
  File "parser.pyx", line 557, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:5387)
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
NorthCat
  • 9,643
  • 16
  • 47
  • 50
khe
  • 131
  • 10
  • Can you compare the file name in `os.listdir` ? – Andy Hayden Jun 04 '14 at 17:59
  • The file names show up correctly in os.listdir: os.listdir(path='C:\...\Datasets\Venezuela\Provinces\Points') ['Amazonas.csv', 'Anzoátegui.csv', 'Apure.csv', 'Aragua.csv', 'Barinas.csv', 'Bolívar.csv', 'Carabobo.csv', 'Cojedes.csv', 'Delta Amacuro.csv', 'Distrito Capital.csv', 'Falcón.csv', 'Guárico.csv', 'Lara.csv', 'Miranda.csv', 'Monagas.csv', 'Mérida.csv', 'Nueva Esparta.csv', 'Portuguesa.csv', 'Sucre.csv', 'Trujillo.csv', 'Táchira.csv', 'Vargas.csv', 'Yaracuy.csv', 'Zulia.csv'] – khe Jun 04 '14 at 18:06
  • 1
    hmmm, and does `pd.read_csv(os.path.join(os.getcwd(), os.listdir()[1]))` work? – Andy Hayden Jun 04 '14 at 18:08
  • No, that does not work, and produces the same OSError as above: `OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist` – khe Jun 04 '14 at 18:22
  • I think the file name is being treated with the default Python encoding UTF-8 within read_csv, which can't handle accents. Trying to convert to latin-1. – khe Jun 04 '14 at 18:24
  • 1
    Fishy! I think this could be a bug, do you mind posting this as an issue on github? – Andy Hayden Jun 04 '14 at 18:44
  • Interestingly I have no idea how to do that - I'm pretty new to the open-source world. Do you mean on the [pandas Issues page](https://github.com/pydata/pandas/issues)? – khe Jun 04 '14 at 19:07
  • that's precisely what I mean :) Also, please include the entire stacktrace (both here and on github). Thanks! – Andy Hayden Jun 04 '14 at 19:12
  • Thanks for your help, I've added the rest of the error, and I'll try posting on github as well. – khe Jun 04 '14 at 19:19
  • I had no issues with read_csv on python 2.x with any file names (containing accents, cyrillyc symbols and other unicode chars), so I believe this is a python 3.x bug. Added relevant tag. – alko Jun 04 '14 at 20:44
  • It may be an issue with an old version of pandas (0.13.0 vs. 0.14.0). Working to resolve. – khe Jun 04 '14 at 20:47

1 Answers1

3

Ok folks, I got a little lost in dependency hell, but it turns out that this issue was fixed in pandas 0.14.0. Install the updated version to get files named with accents to import correctly.

Comments at github.

Thanks for the input!

khe
  • 131
  • 10