pandas sort rows by index using a custom function

Question

I often need to work with dataframes where the rownames (index in pandas) are like of the type [a-z][0-9]+ (e.g. 'a20','a21',... or 'b1','b11','b2',...)

I would like to be able to sort the rows based on the number alone. From what I can gather of the API, df.sort_index() does not take a function argument where I can define the sorting function.

Is there a neat way of achieving this without resorting to removing the initial characters when creating the dataframe?

i think the current best approach is to construct a column from the index with the letter prefixes removed, and sort by that. — Haleemur Ali, Feb 05 '18 at 16:21
I feel like the best theoretical way to do this is to have a multi-index of one string column and one integer column, then you can use standard pandas sorting and features. However, one of the current answers might very well be a better practical answer since multi-indexes introduce more complexity. — JohnE, Feb 05 '18 at 16:48
I agree with @HaleemurAli, but only if the numeric part of the index is a unique identifier. I.e. how do you want to deal with A25 & B25 (if that is possible). — JohnE, Feb 05 '18 at 16:56

jezrael · Answer 1 · 2018-02-06T10:16:26.033

You can use extract, convert to int and get position of sorted values by argsort, last use iloc:

(borrowing data from @Dark):

df = df.iloc[df.index.str.extract('(\d+)', expand=False).astype(int).argsort()]
print (df)
     0
a5   5
a6   6
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4

Explanation:

df = pd.DataFrame([1,2,3,4,5,6,7,8],index=['a12','a10','a11','a15','a5','a6','b11','b12'])
print (df)
     0
a12  1
a10  2
a11  3
a15  4
a5   5
a6   6
b11  7
b12  8

First extract all integers:

print (df.index.str.extract('(\d+)', expand=False))
Index(['12', '10', '11', '15', '5', '6', '11', '12'], dtype='object')

Cast to integers:

print (df.index.str.extract('(\d+)', expand=False).astype(int))
Int64Index([12, 10, 11, 15, 5, 6, 11, 12], dtype='int64')

Get positions of sorted values by argsort:

print (df.index.str.extract('(\d+)', expand=False).astype(int).argsort())
[4 5 1 2 6 0 7 3]

Last select values by positions by iloc:

print (df.iloc[df.index.str.extract('(\d+)', expand=False).astype(int).argsort()])
     0
a5   5
a6   6
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4

Could you elaborate a bit on what this does? It's cool that it seems to work but I'd like to be able to understand what it does :) — posdef, Feb 06 '18 at 10:10

score 2 · Answer 2 · answered Feb 05 '18 at 16:21

2

By using natsort sort by index list then reindex

df=pd.DataFrame({'v':[1,2,3]},index=['b1','b11','b2'])

from natsort import natsorted, ns


df.reindex(natsorted(df.index.tolist(), alg=ns.IGNORECASE))

Out[795]: 
     v
b1   1
b2   3
b11  2

answered Feb 05 '18 at 16:21

BENY

317,841
20
164
234

You were a bit quicker – Bharath M Shetty Feb 05 '18 at 16:22
@Dark yep , just a little bit :-) – BENY Feb 05 '18 at 16:24

score 2 · Answer 3 · answered Feb 05 '18 at 16:22

2

You can use sorted and reindex i.e

df = pd.DataFrame([1,2,3,4,5,6,7,8],index=['a12','a10','a11','a15','a5','a6','b11','b12'])

import re
df.reindex(sorted(df.index,key=lambda x: re.sub('[A-z]','',x)))


    0
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4
a5   5
a6   6

answered Feb 05 '18 at 16:22

Bharath M Shetty

30,075
6
57
108

2

I think need convert to `int` :) – jezrael Feb 05 '18 at 16:48

pandas sort rows by index using a custom function

3 Answers3