In Python, how to sort a dataframe containing accents?

Question

I use sort_values to sort a dataframe. The dataframe contains UTF-8 characters with accents. Here is an example:

>>> df = pd.DataFrame ( [ ['i'],['e'],['a'],['é'] ] )
>>> df.sort_values(by=[0])
   0
2  a
1  e
0  i
3  é

As you can see, the "é" with an accent is at the end instead of being after the "e" without accent.

Note that the real dataframe has several columns !

I would recommend stripping diacritics for the sorting, then re-adding them. — user3483203, May 07 '18 at 15:34

jpp · Answer 1 · 2018-05-07T15:57:22.460

5

This is one way. The simplest solution, as suggested by @JonClements:

df = df.iloc[df[0].str.normalize('NFKD').argsort()]

An alternative, long-winded solution, normalization code courtesy of @EdChum:

df = pd.DataFrame([['i'],['e'],['a'],['é']])

df = df.iloc[df[0].str.normalize('NFKD').argsort()]

# remove accents
df[1] = df[0].str.normalize('NFKD')\
             .str.encode('ascii', errors='ignore')\
             .str.decode('utf-8')

# sort by new column, then drop
df = df.sort_values(1, ascending=True)\
       .drop(1, axis=1)

print(df)

   0
2  a
1  e
3  é
0  i

edited May 07 '18 at 15:57

answered May 07 '18 at 15:40

jpp

159,742
34
281
339

3

In this case - might be that: `df.iloc[df[0].str.normalize('NFKD').argsort()]` is all that's needed... – Jon Clements May 07 '18 at 15:47
Do we need the encoding into ASCII? As soon as we do that, we no longer have all the és after all the es. – DSM May 07 '18 at 15:48
@DSM, That's a good point, but which *should* come first, `e` or `é`? – jpp May 07 '18 at 15:52
1

*As you can see, the "é" with an accent is at the end instead of being after the "e" without accent.* - seems to imply it's preferred that it's after the plain ASCII character... but why that should really *matter* at the end of the day instead of just being "with it" is another question... – Jon Clements May 07 '18 at 15:54

In Python, how to sort a dataframe containing accents?

1 Answers1