Trouble converting a pandas dataframe into a list with the right utf-8 encoding

Question

I'm trying to convert a Pandas Dataframe into a list, which works but I have some issues with the encoding. I hope someone can give me advice on how to handle this problem. Right now, I'm using Python 2.7.

I'm loading an excel file and it loads correctly.

I'm using following code and I get following output:

germanStatesExcelFile='German_States.xlsx'
ePath_german_states=(os.path.dirname(__file__))+'/'+germanStatesExcelFile
german_states = pd.read_excel(ePath_german_states)
print("doc " + str(german_states))

Output:

                    states
0        baden-württemberg
1                   bayern
2                   hessen
3          rheinland-pfalz
4                 saarland
5      nordrhein-westfalen

The next step is converting this Dataframe into a list, which I do with following code:

german_states = german_states['states'].tolist()

Output:

[u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']

It seems like the list is converting utf-8 not right. so i tried following step:

german_states = [x.encode('utf-8') for x in german_states]

Output:

['baden-w\xc3\xbcrttemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']

I would like to have following Output:

['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']

Try decoding ie `german_states['states'].str.decode('utf-8').tolist()` — Bharath M Shetty, Feb 17 '18 at 13:10
i get following error msg: TypeError: list indices must be integers, not str — yellow days, Feb 17 '18 at 15:38
use it after ur read excel not after converting it to list.. And to be on safer side use `...astype('str').str.decode(...` if you have missing values as nan — Bharath M Shetty, Feb 17 '18 at 16:35

score 1 · Answer 1 · answered Mar 19 '21 at 15:42

1

Little late to the party, but if encoding to utf-8 like below doesn't work, you could use the unicodedata.normalize module

german_states_decoded = [x.encode('utf-8') for x in german_states]

answered Mar 19 '21 at 15:42

Thomas Callahan

25
4

jpp · Answer 2 · 2018-02-17T13:29:34.447

0

If your strings only contain ascii characters, you could try python's in-built str, as below. This works with the strings you provided, but may not necessarily be the case.

Otherwise, there are a number of good answers to a similar question.

german_states = [u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']

german_states = list(map(str, german_states))

# ['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']

edited Feb 17 '18 at 13:29

answered Feb 17 '18 at 13:12

jpp

159,742
34
281
339

I tried your code, but i still get this output: ['baden-w\xc3\xbcrttemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen'] – yellow days Feb 17 '18 at 15:36
must be an environment issue (e.g. Windows dependent). I suggest you look at the answer on the link I provided. It's a bit more complex, but should work. – jpp Feb 17 '18 at 15:39

Trouble converting a pandas dataframe into a list with the right utf-8 encoding

2 Answers2