Pandas - Group by and rank within group based on multiple columns

Question

I have a dataframe like this:

df = pd.DataFrame({'asset_id': [10,10, 10, 20, 20, 20], 'method_id': ['p2','p3','p4', 'p3', 'p1', 'p2'], 'method_rank': [5, 2, 2, 2, 5, 1], 'conf_score': [0.8, 0.6, 0.8, 0.9, 0.7, 0.5]} , columns= ['asset_id', 'method_id','method_rank', 'conf_score'])

It looks like this:

   asset_id method_id  method_rank  conf_score
0    10        p2          5         0.8
1    10        p3          2         0.6
2    10        p4          2         0.8
3    20        p3          2         0.9
4    20        p1          5         0.7
5    20        p2          1         0.5

I want to group the rows by asset id, and then give each row an overall rank based on method_rank ascending and conf_score descending.

ie. I want the result to look like this:

  asset_id method_id  method_rank  conf_score  overall_rank
5    20        p2         1           0.5          1.0
3    20        p3         2           0.9          2.0
2    10        p4         2           0.8          1.0
1    10        p3         2           0.6          2.0
0    10        p2         5           0.8          3.0
4    20        p1         5           0.7          3.0

How can I do this using group by and ranking in pandas? It looks like in pandas you can only do it based on one column, like

df["overall_rank"] = df.groupby('asset_id')['method_rank'].rank("first")

But I want to achieve something like

df["overall_rank"] = df.groupby('asset_id')[['method_rank', 'conf_score']].rank("first", ascending = [True, False])

How do I do this? I am aware that a hacky way is to first use sort_values on the entire dataframe and then do groupby, but sorting the rows of the entire dataframe seems too expensive when I only want to sort a few rows in each group.

Ferris · Accepted Answer · 2021-03-05T09:38:52.107

method1:

df.sort_values(['asset_id', 'method_rank', 'conf_score'], ascending=[True, True, False], inplace=True)
df['overall_rank'] = 1
df['overall_rank'] = df.groupby(['asset_id'])['overall_rank'].cumsum()

df

   asset_id method_id  method_rank  conf_score  overall_rank
2        10        p4            2         0.8             1
1        10        p3            2         0.6             2
0        10        p2            5         0.8             3
5        20        p2            1         0.5             1
3        20        p3            2         0.9             2
4        20        p1            5         0.7             3

method2:

define a function to sort every group:

def handle_group(group):
    group.sort_values(['method_rank', 'conf_score'], ascending=[True, False], inplace=True)
    group['overall_rank'] = np.arange(1, len(group)+1)
    return group

df.groupby('asset_id', as_index=False).apply(handle_group)

performance test:

def run1(df):
    df = df.sort_values(['asset_id', 'method_rank', 'conf_score'], ascending=[True, True, False])
    df['overall_rank'] = 1
    df['overall_rank'] = df.groupby(['asset_id'])['overall_rank'].cumsum()    
    return df

def handle_group(group):
    group.sort_values(['method_rank', 'conf_score'], ascending=[True, False], inplace=True)
    group['overall_rank'] = np.arange(1, len(group)+1)
    return group

def run2(df):
    df = df.groupby('asset_id', as_index=False).apply(handle_group)
    return df

dfn = pd.concat([df]*10000, ignore_index=True)

%%timeit
df1 = run1(dfn)
# 8.61 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
df2 = run2(dfn).droplevel(0)
# 31.6 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Actually, method1 is more efficient than method2. – Ferris Mar 05 '21 at 09:30 — Ferris, Mar 05 '21 at 09:30

Pandas - Group by and rank within group based on multiple columns

1 Answers1

Linked