2

I often need to work with dataframes where the rownames (index in pandas) are like of the type [a-z][0-9]+ (e.g. 'a20','a21',... or 'b1','b11','b2',...)

I would like to be able to sort the rows based on the number alone. From what I can gather of the API, df.sort_index() does not take a function argument where I can define the sorting function.

Is there a neat way of achieving this without resorting to removing the initial characters when creating the dataframe?

posdef
  • 6,498
  • 11
  • 46
  • 94
  • Perhaps you can use reindex. – Bharath M Shetty Feb 05 '18 at 16:18
  • 1
    i think the current best approach is to construct a column from the index with the letter prefixes removed, and sort by that. – Haleemur Ali Feb 05 '18 at 16:21
  • I feel like the best theoretical way to do this is to have a multi-index of one string column and one integer column, then you can use standard pandas sorting and features. However, one of the current answers might very well be a better practical answer since multi-indexes introduce more complexity. – JohnE Feb 05 '18 at 16:48
  • I agree with @HaleemurAli, but only if the numeric part of the index is a unique identifier. I.e. how do you want to deal with A25 & B25 (if that is possible). – JohnE Feb 05 '18 at 16:56

3 Answers3

3

You can use extract, convert to int and get position of sorted values by argsort, last use iloc:

(borrowing data from @Dark):

df = df.iloc[df.index.str.extract('(\d+)', expand=False).astype(int).argsort()]
print (df)
     0
a5   5
a6   6
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4

Explanation:

df = pd.DataFrame([1,2,3,4,5,6,7,8],index=['a12','a10','a11','a15','a5','a6','b11','b12'])
print (df)
     0
a12  1
a10  2
a11  3
a15  4
a5   5
a6   6
b11  7
b12  8

First extract all integers:

print (df.index.str.extract('(\d+)', expand=False))
Index(['12', '10', '11', '15', '5', '6', '11', '12'], dtype='object')

Cast to integers:

print (df.index.str.extract('(\d+)', expand=False).astype(int))
Int64Index([12, 10, 11, 15, 5, 6, 11, 12], dtype='int64')

Get positions of sorted values by argsort:

print (df.index.str.extract('(\d+)', expand=False).astype(int).argsort())
[4 5 1 2 6 0 7 3]

Last select values by positions by iloc:

print (df.iloc[df.index.str.extract('(\d+)', expand=False).astype(int).argsort()])
     0
a5   5
a6   6
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Could you elaborate a bit on what this does? It's cool that it seems to work but I'd like to be able to understand what it does :) – posdef Feb 06 '18 at 10:10
2

By using natsort sort by index list then reindex

df=pd.DataFrame({'v':[1,2,3]},index=['b1','b11','b2'])

from natsort import natsorted, ns


df.reindex(natsorted(df.index.tolist(), alg=ns.IGNORECASE))

Out[795]: 
     v
b1   1
b2   3
b11  2
BENY
  • 317,841
  • 20
  • 164
  • 234
2

You can use sorted and reindex i.e

df = pd.DataFrame([1,2,3,4,5,6,7,8],index=['a12','a10','a11','a15','a5','a6','b11','b12'])

import re
df.reindex(sorted(df.index,key=lambda x: re.sub('[A-z]','',x)))


    0
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4
a5   5
a6   6
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108