1

I am trying to read_csv a csv file with cyrillic charactres with pandas.

import pandas
data = pandas.read_csv('dataset.csv', delimiter='\|\|', engine='python', encoding='utf-8')
print type(data.name[0])

<type 'str'>

Here, I am expecting to get unicode as with

print type(u'hello')

<type 'unicode'>

What I am doing wrong?

com
  • 2,606
  • 6
  • 29
  • 44
  • Python is duck-typed. You should never ask for what type an object is. This being said, you need to provide some kind of example where you show what you have and what your desired output is. Your code appears correct AFAIK – firelynx Apr 04 '17 at 08:15
  • I don't know how pandas implements the `read_csv` method, but if it uses the std.lib. `csv` module, then the solution to this probably isn't trivial, because Python 2's `csv` doesn't support decoding files (which is quite sad, in fact). One more reason to switch to Python 3 now! – lenz Apr 04 '17 at 10:22

1 Answers1

0

Short answer: Unicode is uncoded text. UTF-8 is a way of encoding unicode characters. When pandas imports your utf-8 encoded text, it converts it to python str type, which is decoded text. In python 3, the str type is the same as unicode.

For a more in-depth understanding, see:

  1. UTF-8 vs Unicode
  2. Python str vs Unicode
Community
  • 1
  • 1
oscarbranson
  • 3,877
  • 1
  • 13
  • 15
  • Thank you very much for the clarification. – com Apr 04 '17 at 04:57
  • 1
    The OP is apparently using Python 2 (see the print statement). In Python 2, *decoded* (I think that's what you mean by “uncoded”) text is of type `unicode`. So, apparently, pandas did **not** properly decode the input text. – lenz Apr 04 '17 at 07:52
  • Fair point... I'm not sure how this works in python 2. Any ideas @lenz? – oscarbranson Apr 05 '17 at 03:47
  • Looking at [pandas' `read_csv` documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), I strongly suspect that they're using the std.lib. `csv` module (because of the keywords "quoting", "quotechar" etc.). This means that it needs a workaround in Python 2, such as decoding after parsing. – lenz Apr 05 '17 at 08:28