4

How can I rank a DataFrame based on 2 columns?

On below example, col_b would the tie breaker for col_a.

DataFrame:

df = pd.DataFrame({'col_a':[0,0,0,1,1,1], 'col_b':[5,2,8,3,7,4]})

df
   col_a  col_b
0      0      5
1      0      2
2      0      8
3      1      3
4      1      7
5      1      4

Expected Output:

   col_a  col_b  Rank
0      0      5   2
1      0      2   1
2      0      8   3
3      1      3   4
4      1      7   6
5      1      4   5
Mazdak
  • 105,000
  • 18
  • 159
  • 188
Hugo
  • 147
  • 1
  • 7
  • You could do something similar to https://stackoverflow.com/questions/17604665/python-pandas-rank-by-column with an intermediate grouping. i.e. `col_a_Rank` and group on that and rank `col_b` to get final `Rank` value. – dub stylee Mar 16 '18 at 18:33

4 Answers4

4

Here is a one-line approach using sort_values:

In [135]: df['rank'] = df.sort_values(['col_a', 'col_b'])['col_b'].index + 1

In [136]: df
Out[136]: 
   col_a  col_b  rank
0      0      5     2
1      0      2     1
2      0      8     3
3      1      3     4
4      1      7     6
5      1      4     5

The logic behind this snippet: Basically, the DataFrame.sort_values function accepts multiple column names and returns a sorted copy of the dataframe based on the order of passed column names. The default sorting order is ascending which is what we want. If you wanted another order you could pass the order as an iterable of booleans to the ascending keyword argument. At the end the new indices of the column_b is what we want (plus one).

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 1
    I was trying to avoid sorting. Not sure how it affects performance on a big DataFrame. But the solution works great. I will test it against the solution I gave to check which one is faster. – Hugo Mar 16 '18 at 19:19
  • Depending on your dataframe, you might need to reset the index first. Similar to this answer: https://stackoverflow.com/questions/49279404/how-do-i-create-a-rank-column-in-pandas – DataMan Apr 01 '19 at 17:48
3

Here is one way. Create a temp DataFrame by sorting the columns and re-indexing. Then use the new index as the rank and join back to the original df.

temp_df = df.sort_values(['col_a', 'col_b']).reset_index()
temp_df['rank'] = temp_df.index + 1
print(temp_df)
#   index  col_a  col_b  rank
#0      1      0      2     1
#1      0      0      5     2
#2      2      0      8     3
#3      3      1      3     4
#4      5      1      4     5
#5      4      1      7     6

The column 'index' corresponds to the index in the original DataFrame. Use this to join temp_df back to df and select the columns you want:

df = df.join(temp_df.set_index('index'), rsuffix="_r")[['col_a', 'col_b', 'rank']]
print(df)
#   col_a  col_b  rank
#0      0      5     2
#1      0      2     1
#2      0      8     3
#3      1      3     4
#4      1      7     6
#5      1      4     5
pault
  • 41,343
  • 15
  • 107
  • 149
2

Found my own solution: Create a tuple with the columns and rank it. Won't handle different ascending/descending order, but it is good for my problem.

df['rank'] = df[['col_a','col_b']].apply(tuple, 1).rank()
Hugo
  • 147
  • 1
  • 7
1

Using numpy's argsort method.

df = pd.DataFrame({'col_a':[0,0,0,1,1,1], 'col_b':[5,2,8,3,7,4]})
df["rank"] = np.argsort(
              df.values.copy().view(dtype=[('x', int), ('y', int)]).flatten(),  
              order=("x","y")
             ) + 1


    col_a   col_b   rank
0   0        5       2
1   0        2       1
2   0        8       3
3   1        3       4
4   1        7       6
5   1        4       5
Tai
  • 7,684
  • 3
  • 29
  • 49