Pandas rank by multiple columns

Question

I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? 'SaleCount', then 'TotalRevenue'?

import pandas as pd

df = pd.DataFrame({'TotalRevenue':[300,9000,1000,750,500,2000,0,600,50,500],
    'Date':['2016-12-02' for i in range(10)],
    'SaleCount':[10,100,30,35,20,100,0,30,2,20],
    'shops':['S3','S2','S1','S5','S4','S8','S6','S7','S9','S10']})

df['Rank'] = df.SaleCount.rank(method='dense',ascending = False).astype(int)

#df['Rank'] = df.TotalRevenue.rank(method='dense',ascending = False).astype(int)
df.sort_values(['Rank'], inplace=True)

print(df)

current output:

    Date        SaleCount   TotalRevenue    shops   Rank
1   2016-12-02  100          9000            S2      1
5   2016-12-06  100          2000            S8      1
3   2016-12-04  35           750             S5      2
2   2016-12-03  30           1000            S1      3
7   2016-12-08  30           600             S7      3
9   2016-12-10  20           500             S10     4
4   2016-12-05  20           500             S4      4
0   2016-12-01  10           300             S3      5
8   2016-12-09  2            50              S9      6
6   2016-12-07  0            0               S6      7

I'm trying to generate an output like this:

    Date        SaleCount   TotalRevenue    shops   Rank
1   2016-12-02  100          9000            S2      1
5   2016-12-02  100          2000            S8      2
3   2016-12-02  35           750             S5      3
2   2016-12-02  30           1000            S1      4
7   2016-12-02  30           600             S7      5
9   2016-12-02  20           500             S10     6
4   2016-12-02  20           500             S4      6
0   2016-12-02  10           300             S3      7
8   2016-12-02  2            50              S9      8
6   2016-12-02  0            0               S6      9

score 28 · Answer 1 · answered May 08 '19 at 16:30

28

The generic way to do that is to group the desired fiels in a tuple, whatever the types.

df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
             .rank(method='dense',ascending=False).astype(int)

df.sort_values("Rank")

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9

answered May 08 '19 at 16:30

B. M.

18,243
2
35
54

2

What if we would like to group by another column like shops in this case? – Eduardo EPF Oct 22 '19 at 20:24
@EduardoEPF Add it to the list, ie df[["SaleCount","TotalRevenue", "shops"]].apply(tuple,axis=1)\ .rank(method='dense',ascending=False).astype(int) – Ion Freeman Nov 18 '21 at 22:27
1

@EduardoEPF Adding the tuple to df before grouping worked for me: `df.assign(_rankby=df[['SaleCount', 'TotalRevenue']].apply(tuple, axis=1)).groupby('shops')._rankby.rank(method='dense', ascending=False).astype(int)`. – Vic Apr 06 '22 at 22:31

score 13 · Answer 2 · edited Jan 22 '20 at 15:25

pd.factorize will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.

cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)

df.assign(Rank=factorized)

         Date  SaleCount  TotalRevenue shops  Rank
1  2016-12-02        100          9000    S2     1
5  2016-12-02        100          2000    S8     2
3  2016-12-02         35           750    S5     3
2  2016-12-02         30          1000    S1     4
7  2016-12-02         30           600    S7     5
4  2016-12-02         20           500    S4     6
9  2016-12-02         20           500   S10     6
0  2016-12-02         10           300    S3     7
8  2016-12-02          2            50    S9     8
6  2016-12-02          0             0    S6     9

Nickil Maveli · Accepted Answer · 2017-02-01T08:28:53.710

8

Another way would be to type-cast both the columns of interest to str and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.

In method=dense, ranks of duplicated values would remain unchanged. (Here: 6)

Since you want to rank these in their descending order, specifying ascending=False in Series.rank() would let you achieve the desired result.

col1 = df["SaleCount"].astype(str) 
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')

edited Feb 01 '17 at 08:28

answered Feb 01 '17 at 08:25

Nickil Maveli

29,155
8
82
85

Yes, it is. Else, `rank` wouldn't be able to assign groups based on their magnitude. – Nickil Maveli Feb 01 '17 at 08:30
@piRSquared: Thanks a lot :-) – Nickil Maveli Feb 01 '17 at 08:36
2

Note it would be dangerous to string-concatenate the two columns `(col1+col2` *without* casting back `astype(int)` which would be dangerous, as '30' > '100' in string-land sort order. – smci May 13 '18 at 10:25
1

This might be syntactically right, but if you see the ranking is still based on TotalRevenue column effectively. This is because of both numbers are in different scale, so their sum is mostly influenced by TotalRevenue and not much by SaleCount. Maybe normalizing them before adding up should help. – Manish Oct 10 '18 at 12:14
1

This solution is basically ranking as per totalrevenue – Eduardo EPF Oct 22 '19 at 19:12
1

this is wrong: e.g. change 'TotalRevenue' of the 3rd row from 1000 to 100000 it will pop up to 1st place which is undesired – dingx May 10 '20 at 15:25
Why did OP select this solution when they asked for an ultimate idx order of 1,5,3,2,7… and this yields 1,5,2,3,7…? @NickilMaveli, this appears plain wrong, and has been downvoted 5 times. What do you think about deleting this? – Zach Young Mar 11 '23 at 18:13

score 6 · Answer 4 · answered Oct 22 '19 at 21:44

`sort_values` + `GroupBy.ngroup`

This will give the dense ranking.

Columns should be sorted in the desired order prior to the groupby. Specifying sort=False within the groupby then respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.

cols = ['SaleCount', 'TotalRevenue']
df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1

Output:

print(df.sort_values('Rank'))

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9

score 1 · Answer 5 · answered May 13 '18 at 10:47

(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)

However here's a shortcut if you know that TotalRevenue is constrained to some range e.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:

df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)

df.sort_values('Rank2')

score 0 · Answer 6 · answered Jun 03 '22 at 12:46

This function will rank successively by a list of columns and supports ranking with groups (something that cannot be done if you just order all rows by multiple columns).

def rank_multicol(
    df: pd.DataFrame,
    rank_by: List[str],
    group_by: Optional[List[str]] = None,
    ascending: Union[List[bool], bool] = True,
    rank_col_name: str = 'rank',
) - > pd.DataFrame:
    df_aux = df.copy()
    columns_to_group_by = [] if group_by is None else group_by
    if type(ascending) is bool:
        ascending = [ascending for _ in range(len(rank_by))]
    elif len(ascending) != len(rank_by):
        raise ValueError("`ascending` must be a scalar or have the same length of `rank_by`.")

    for idx, feature in enumerate(rank_by):
        # TODO: Optimize if no untying is required
        if columns_to_group_by:
            df_to_rank = df_aux.groupby(columns_to_group_by)
        else:
            df_to_rank = df_aux.copy()
        ranks = (
            df_to_rank
            [feature]
            .rank(ascending=ascending[idx], method='min')
            .rename(rank_col_name)
        )
        if rank_col_name in df_aux:
            df_aux[rank_col_name] = ranks + (df_aux[rank_col_name] - 1)
        else:
            df_aux[rank_col_name] = ranks

        columns_to_group_by.append(feature)
    return df_aux

Pandas rank by multiple columns

6 Answers6

`sort_values` + `GroupBy.ngroup`

Linked

Related

Pandas rank by multiple columns

6 Answers6

sort_values + GroupBy.ngroup

Linked

Related

`sort_values` + `GroupBy.ngroup`