2

Intro and problem

How can I rank observations within groups where the ranking is based on more than just one column and where the ranking allows for tied ranks?

I know how to calculate aggregated group-level statistics using the groupby() method and I also know how to rank using multiple columns without groups (see here, here and here). The main problem seems to be getting both ideas (grouping & ranking) to play nicely together.

This other thread has some ideas on how to solve the problem, but its results don't show you which rows are tied - it just returns an array of ever-increasing ranks even when the values are identical. The problem is described in more detail in the example I created below.

Minimal reproducible example

import pandas as pd

df = pd.DataFrame({'row_id':[1,2,3,4,5,6,7,8,9,10],
                   'Group':[1,1,1,1,1,2,2,2,2,2],
                   'Var1':[100,100,100,200,200,300,300,400,400,400],
                   'Var2':[5,5,6,7,8,1,1,2,2,3]})
print(df)

#    row_id  Group  Var1  Var2
# 0       1      1   100     5
# 1       2      1   100     5
# 2       3      1   100     6
# 3       4      1   200     7
# 4       5      1   200     8
# 5       6      2   300     1
# 6       7      2   300     1
# 7       8      2   400     2
# 8       9      2   400     2
# 9      10      2   400     3

In the case above, I would like to group using the Group variable and rank using the Var1 and Var2 variables. Therefore, I expect the output to look like this:

#    row_id  Group  Var1  Var2  Rank
# 0       1      1   100     5     1
# 1       2      1   100     5     1
# 2       3      1   100     6     3
# 3       4      1   200     7     4
# 4       5      1   200     8     5
# 5       6      2   300     1     1
# 6       7      2   300     1     1
# 7       8      2   400     2     3
# 8       9      2   400     2     3
# 9      10      2   400     3     5

What I've tried

Using the data in the example above, if I would like to group using the Group variable and only rank based on the Var1 column, that would be pretty easy:

df['Rank_Only_Var1'] = df.groupby('Group')['Var1'].rank(method='min', ascending=True)

print(df)

#    row_id  Group  Var1  Var2  Rank_Only_Var1
# 0       1      1   100     5             1.0
# 1       2      1   100     5             1.0
# 2       3      1   100     6             1.0
# 3       4      1   200     7             4.0
# 4       5      1   200     8             4.0
# 5       6      2   300     1             1.0
# 6       7      2   300     1             1.0
# 7       8      2   400     2             3.0
# 8       9      2   400     2             3.0
# 9      10      2   400     3             3.0

However, if I want to group using the Group variable and rank using the Var1 and Var2 variables, things get hairy. Using the approach suggested by this other post, we arrive at the following results:

df = df.sort_values(['Var1', 'Var1'], ascending=[True, True])
df['overall_rank'] = 1
df['overall_rank'] = df.groupby(['Group'])['overall_rank'].cumsum()

print(df)

#    row_id  Group  Var1  Var2  overall_rank
# 0       1      1   100     5             1
# 1       2      1   100     5             2
# 2       3      1   100     6             3
# 3       4      1   200     7             4
# 4       5      1   200     8             5
# 5       6      2   300     1             1
# 6       7      2   300     1             2
# 7       8      2   400     2             3
# 8       9      2   400     2             4
# 9      10      2   400     3             5

Note how the first and second rows have identical values for Var1 and Var2, but the first row is ranked 1 and the second row is ranked 2. Those two rows shouldn't have different ranks. Their ranks should be identical and tied, because the values the rank is based on are identical and tied. This problem also happens with rows 6 & 7 as well as with rows 8 & 9.

I even tried adapting the solution from this answer, but it doesn't work when we have a groupby statement.

Back to the heart of the question

How can I rank observations within groups where the ranking is based on more than just one column and where the ranking allows for tied ranks?

Felipe D.
  • 1,157
  • 9
  • 19

1 Answers1

1

It's not clear why you can't use the linked solution with a .groupby

You can also replace the .apply for better performance:

(df.assign(key = list(map(tuple, df[['Var1', 'Var2']].values)))
   .groupby('Group')['key'].rank('dense'))
0    1.0
1    1.0
2    2.0
3    3.0
4    4.0
5    1.0
6    1.0
7    2.0
8    2.0
9    3.0
Name: key, dtype: float64
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Awesome, thank you so much! I was getting the order of operations wrong: I tried using the `groupby` **before** the `assign`/`apply` commands. This works perfectly! Thanks again! – Felipe D. Apr 07 '23 at 01:20
  • Do you happen to know how to control the order in which the ranking is done for each variable? For example, say we want to rank `Var1` in ascending order but `Var2` in descending order. Is that possible? – Felipe D. Apr 07 '23 at 01:31
  • The only trick I can think of is to figure out which columns you want to be treated in the inverse order and just multiply them by -1 using an `assign` statement`. So in this case, if we want to do ascending by `Var1` and descending by `Var2`, we could do the following: ```(df.assign(Var2_Inv = -df['Var2'], key = list(map(tuple, df[['Var1', 'Var2_Inv']].values))) .groupby('Group')['key'] .rank('dense', ascending=True))``` – Felipe D. Apr 07 '23 at 01:36
  • 1
    I'm not sure if there is a better way than doing what you've suggested. – jqurious Apr 07 '23 at 01:40