5

I have a dataframe where the index value is a mixture of string and number separated by an underscore.

    sub_int1_ICA_int2  # 

enter image description here

I would like to sort the column index using int1 first and after that int2 The expected output would be:

    sub_1_ICA_1
    sub_1_ICA_2
    sub_1_ICA_3
    ...........
    sub_2_ICA_1
    sub_2_ICA_2
    ...........

I tried to use convert_numeric as I saw in many posts, but I get an error

     X.convert_objects(convert_numeric=True).sort_values(['id] , ascending=[True], inplace=True)
    >>(KeyError: 'id')

Any help would be nice!

Ben
  • 329
  • 1
  • 3
  • 17

1 Answers1

2

Use reindex by sorted list by custom function with dictionary of tuples:

print (df)
              a
sub_1_ICA_0   4
sub_1_ICA_1   8
sub_1_ICA_10  7
sub_1_ICA_11  3
sub_1_ICA_12  2
sub_1_ICA_2   6
sub_1_ICA_3   6
sub_2_ICA_1   1
sub_2_ICA_2   3


a = df.index.tolist()
b = {}
for x in a:
    i = x.split('_')
    b[x] = ((int(i[1]), int(i[-1])))
print (b)
{'sub_1_ICA_10': (1, 10), 'sub_1_ICA_11': (1, 11), 
'sub_1_ICA_1': (1, 1), 'sub_2_ICA_2': (2, 2),
 'sub_1_ICA_0': (1, 0), 'sub_1_ICA_12': (1, 12), 
 'sub_1_ICA_3': (1, 3), 'sub_1_ICA_2': (1, 2),
 'sub_2_ICA_1': (2, 1)}

c = sorted(a, key=lambda x: b[x])
print (c)
['sub_1_ICA_0', 'sub_1_ICA_1', 'sub_1_ICA_2', 'sub_1_ICA_3', 
'sub_1_ICA_10', 'sub_1_ICA_11', 'sub_1_ICA_12', 'sub_2_ICA_1', 'sub_2_ICA_2']

df = df.reindex(c)
print (df)
              a
sub_1_ICA_0   4
sub_1_ICA_1   8
sub_1_ICA_2   6
sub_1_ICA_3   6
sub_1_ICA_10  7
sub_1_ICA_11  3
sub_1_ICA_12  2
sub_2_ICA_1   1
sub_2_ICA_2   3

Another pure pandas solution:

#create MultiIndex by split index, convert to DataFrame
df1 = df.index.str.split('_', expand=True).to_frame()
#set columns and index to original df
df1.columns = list('abcd')
df1.index = df.index
#convert columns to int and sort
df1[['b','d']] = df1[['b','d']].astype(int)
df1 = df1.sort_values(['b','d'])
print (df1)
                a  b    c   d
sub_1_ICA_0   sub  1  ICA   0
sub_1_ICA_1   sub  1  ICA   1
sub_1_ICA_2   sub  1  ICA   2
sub_1_ICA_3   sub  1  ICA   3
sub_1_ICA_10  sub  1  ICA  10
sub_1_ICA_11  sub  1  ICA  11
sub_1_ICA_12  sub  1  ICA  12
sub_2_ICA_1   sub  2  ICA   1
sub_2_ICA_2   sub  2  ICA   2

df = df.reindex(df1.index)
print (df)
              a
sub_1_ICA_0   4
sub_1_ICA_1   8
sub_1_ICA_2   6
sub_1_ICA_3   6
sub_1_ICA_10  7
sub_1_ICA_11  3
sub_1_ICA_12  2
sub_2_ICA_1   1
sub_2_ICA_2   3

And last version with natsort:

from natsort import natsorted

df = df.reindex(natsorted(df.index))
print (df)
              a
sub_1_ICA_0   4
sub_1_ICA_1   8
sub_1_ICA_2   6
sub_1_ICA_3   6
sub_1_ICA_10  7
sub_1_ICA_11  3
sub_1_ICA_12  2
sub_2_ICA_1   1
sub_2_ICA_2   3

EDIT:

If duplicates values then create new columns by split, convert to int, sort and get back:

print (df)
              a
sub_1_ICA_0   4
sub_1_ICA_0   4
sub_1_ICA_1   8
sub_1_ICA_10  7
sub_1_ICA_11  3
sub_1_ICA_12  2
sub_1_ICA_2   6
sub_1_ICA_3   6
sub_2_ICA_1   1
sub_2_ICA_2   3

df.index = df.index.str.split('_', expand=True)
df = df.reset_index()
df[['level_1','level_3']] = df[['level_1','level_3']].astype(int)
df = df.sort_values(['level_1','level_3']).astype(str)

df = df.set_index(['level_0','level_1','level_2','level_3'])
df.index = df.index.map('_'.join)

print (df)

              a
sub_1_ICA_0   4
sub_1_ICA_0   4
sub_1_ICA_1   8
sub_1_ICA_2   6
sub_1_ICA_3   6
sub_1_ICA_10  7
sub_1_ICA_11  3
sub_1_ICA_12  2
sub_2_ICA_1   1
sub_2_ICA_2   3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Many thanks, jezrael! the first code worked perfectly in a dataframe which originally doesn't have the default index (0,1,2). But, when I use it in dataframe where I set the index to the column 'id' (using set_index) I got the error "cannot reindex from a duplicate axis". When I checked print(c), I can see that all index values are duplicated several time. Any ideas? – Ben Nov 14 '17 at 10:09
  • thanks, but the problem I am not supposed to have duplicated index! – Ben Nov 14 '17 at 10:28
  • Then problem is with data, check duplicated index values by `print (df[df.index.duplicated(keep=False)])` – jezrael Nov 14 '17 at 10:30
  • Yes! there are duplicated entries maybe because I construct the df by concat. Is there a way to remove the duplicated index? – Ben Nov 14 '17 at 10:39
  • The best is `df = df.reset_index(drop=True)` if data in index are not necessary. – jezrael Nov 14 '17 at 11:09
  • Are your data confidental? – jezrael Nov 14 '17 at 11:16
  • no it is not confidential. But, the problem is from a previous code loading csv files into one df [link](https://stackoverflow.com/questions/47258065/load-multiple-csv-files-into-dataframe-columns-names-issue/47258176?noredirect=1#comment81467473_47258176). – Ben Nov 14 '17 at 13:49
  • OK, I have idea - can you send me your files - 10 in zip to my email in my profile? Because without data hard find problem. – jezrael Nov 14 '17 at 13:50
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/158973/discussion-between-ben-and-jezrael). – Ben Nov 14 '17 at 13:55