1

I'm trying to convert a Pandas Dataframe into a list, which works but I have some issues with the encoding. I hope someone can give me advice on how to handle this problem. Right now, I'm using Python 2.7.

I'm loading an excel file and it loads correctly.

I'm using following code and I get following output:

germanStatesExcelFile='German_States.xlsx'
ePath_german_states=(os.path.dirname(__file__))+'/'+germanStatesExcelFile
german_states = pd.read_excel(ePath_german_states)
print("doc " + str(german_states))

Output:

                    states
0        baden-württemberg
1                   bayern
2                   hessen
3          rheinland-pfalz
4                 saarland
5      nordrhein-westfalen

The next step is converting this Dataframe into a list, which I do with following code:

german_states = german_states['states'].tolist()

Output:

[u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']

It seems like the list is converting utf-8 not right. so i tried following step:

german_states = [x.encode('utf-8') for x in german_states]

Output:

['baden-w\xc3\xbcrttemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']

I would like to have following Output:

['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
Ronan Boiteau
  • 9,608
  • 6
  • 34
  • 56
yellow days
  • 1,053
  • 2
  • 9
  • 11

2 Answers2

1

Little late to the party, but if encoding to utf-8 like below doesn't work, you could use the unicodedata.normalize module

german_states_decoded = [x.encode('utf-8') for x in german_states]
0

If your strings only contain ascii characters, you could try python's in-built str, as below. This works with the strings you provided, but may not necessarily be the case.

Otherwise, there are a number of good answers to a similar question.

german_states = [u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']

german_states = list(map(str, german_states))

# ['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
jpp
  • 159,742
  • 34
  • 281
  • 339
  • I tried your code, but i still get this output: ['baden-w\xc3\xbcrttemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen'] – yellow days Feb 17 '18 at 15:36
  • must be an environment issue (e.g. Windows dependent). I suggest you look at the answer on the link I provided. It's a bit more complex, but should work. – jpp Feb 17 '18 at 15:39