3

I have a pandas multiIndex dataframe that I want to order by name and number.

A similar dataset, df is created as example:

random= np.random.rand(3,10)
a = [ 'a','b','a','b','a','b','a','b','a', 'b']
b = ['p11_1','p11_1','p1_1','p1_1','p9_1','p9_1','p10_1','p10_1','p1_2','p1_2']
arrays = [a,b]
df = pd.DataFrame(data = random, columns = arrays)

Header layers look like:

   a         b         a  ...         b         a         b
  p11_1     p11_1      p1_1  ...     p10_1      p1_2      p1_2

I can use the a simple sort command but that results in the wrong output:

df = df.sort_index(axis=1)

Header layers look like:

      a                      ...         b                    
  p10_1     p11_1      p1_1  ...      p1_1      p1_2      p9_1

This orders both layers, but the second layer is not the desired output.

The desired output orders the second layer as follows:

p1_1 < p1_2 < p9_1 < p10_1 < p11_1  

and not as:

p10_1 < p11_1 < p1_1 < p1_2 < p9_1 

Anything would help!

Herwini
  • 371
  • 1
  • 19

3 Answers3

3

Let us try natsorted

from natsort import natsorted
df=df.reindex(columns=natsorted(df.columns.tolist(), key=lambda element: (element[0], element[1])))
Out[126]: 
          a                      ...         b                    
       p1_1      p1_2      p9_1  ...      p9_1     p10_1     p11_1
0  0.122500  0.339663  0.880657  ...  0.258351  0.777972  0.824912
1  0.506081  0.947983  0.502101  ...  0.626606  0.187132  0.344037
2  0.465429  0.688159  0.396115  ...  0.840099  0.750303  0.932954
[3 rows x 10 columns]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks!, but can you tell me how exactly this works? – Herwini Jul 29 '20 at 20:44
  • 1
    @ezrabloemendaal check what natsorted is https://stackoverflow.com/questions/29580978/naturally-sorting-pandas-dataframe, it will sort the string and consider the number and string mix , after that you can use it like normal sorted ~ – BENY Jul 29 '20 at 20:46
  • The other MultiIndex sorting methods seem not to work with integers or float. This worked great! – AhabTheArab Sep 07 '22 at 08:39
2

You can use regex to extract the digits, convert to int and sort:

import re
new_cols = sorted(df.columns, key=lambda x: (x[0],)+tuple(map(int, re.findall('(\d+)', x[1])) ))

df = df[new_cols]

Output:

          a                                                 b            \
       p1_1      p1_2      p9_1     p10_1     p11_1      p1_1      p1_2   
0  0.573945  0.648582  0.018916  0.349072  0.242901  0.672357  0.264832   
1  0.362348  0.046961  0.245218  0.971988  0.337589  0.868794  0.357096   
2  0.653817  0.882649  0.103931  0.778984  0.777559  0.056299  0.252898   

                                 
       p9_1     p10_1     p11_1  
0  0.645881  0.633993  0.340431  
1  0.908928  0.527769  0.862013  
2  0.302780  0.497609  0.378155  
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

Considering you need to naturally sort the second level, you can call numpy.lexsort:

get_level = df.columns.get_level_values
idx = np.lexsort((
    get_level(1).str.extract(f'p(\d+)', expand=False).astype(int), get_level(0)))

df.iloc[:, idx]     

          a                      ...         b                    
       p1_1      p1_2      p9_1  ...      p9_1     p10_1     p11_1
0  0.879848  0.384629  0.006705  ...  0.296052  0.716751  0.790975
1  0.139579  0.158237  0.737015  ...  0.702624  0.356452  0.557185
2  0.335480  0.133805  0.040322  ...  0.161040  0.622088  0.219986
cs95
  • 379,657
  • 97
  • 704
  • 746