Pandas dataframe and character encoding when reading excel file

Question

I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:

import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

Producing the following:

array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)

In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?

score 16 · Accepted Answer · edited May 23 '17 at 12:25

Actually, the data is being parsed correctly into unicode, not strs. The u prefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the repr of the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner

The purpose of the repr is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]

Thank you very much @unutbu. Excellent answer and it clarified more than one fuzzy issue for me. Cheers — Luis Miguel, May 11 '14 at 16:25
How to save the same problem when we save the values into a list and we need to print the list. I'd like to see the right chars. — Sigur, Dec 23 '17 at 00:27
@Sigur: Printing a list causes Python to print brackets around the *repr* of the items in the list separated by commas. If you want the `str` of the items, you need to [compose that yourself](https://stackoverflow.com/a/32849250/190597). You might also need to decode bytes if the objects in your list are `bytes`, not (Python3) `str`s. If this explanation and the link do not fully answer your question, please open a new question with all the details (an example snippet of your list and the desired output). — unutbu, Dec 23 '17 at 01:38
@unutbu, thanks. The link is very useful. Off topic: everytime I read your nick I red as `ubuntu`. lol — Sigur, Dec 23 '17 at 11:33

Pandas dataframe and character encoding when reading excel file

1 Answers1

Linked