Pandas group by dense rank partitioned by multiple columns

Question

I have a df that contains id, building_id, built_year, registered_year and some other columns.

In the df, an id can contain multiple building_id. However, the data is not clean. There can be multiple building_id but different built_year and registered_year. I assume that the building got renovated, but rather than updating the data, it's been added.

So the plan for me is to rank the buildings based on the order of built_year, registered_year.

built_year is in the format 1999 and registered_year is in the format 1999-10-10 10:10:10+00:00

id	building_id	built_year	registered_year
1	90	1999	1999-10-10 10:10:10+00:00
1	89	1925	1925-10-10 10:10:10+00:00
1	90	2001	2000-10-10 10:10:10+00:00
1	90	2009	2000-10-10 10:10:10+00:00

This is what I have:

df['built_year'] = pd.to_numeric(df['built_year'], errors='coerce')
df['registered_year'] = pd.to_datetime(df['registered_year'], errors='coerce')

df['rank'] = df.groupby('building_id')[['built_year', 'registered_year']] \
               .transform(lambda x: x.rank(method = 'dense', ascending = False))

id	building_id	built_year	registered_year	rank
1	90	1999	1999-10-10 10:10:10+00:00	3
1	89	1925	1925-10-10 10:10:10+00:00	1
1	90	2001	2000-10-10 10:10:10+00:00	2
1	90	2009	2000-10-10 10:10:10+00:00	1

Well, the code would work if I had only ranked it over one column, for instance, on only built_year. But by adding more columns, it gives me the following:

ValueError: Cannot set a DataFrame with multiple columns to the single column rank

Any help would be appreciated!

Would you mind providing a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) so it's possible to investigate what's going on? — Simon David, May 22 '23 at 21:09
I've never used `df.rank`, but it sounds like you just need to match the number of columns, like `df[['rank_built', 'rank_registered']] = ...`. If you provided a [mre], I could at least check if that works. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). — wjandrea, May 22 '23 at 21:14
Sorting, followed by grouping and cumulatively counting (in descending order) should return the same desired result: `df["rank"] = df.sort_values(by=["built_year", "registered_year"]).groupby().cumcount(ascending=False) + 1` — Rawson, May 22 '23 at 21:19

Corralien · Accepted Answer · 2023-05-22T21:54:00.113

You can simply use sort_values in this case:

df['rank'] = (df.sort_values(['built_year', 'registered_year'], ascending=False)
                .groupby('building_id').cumcount().add(1))

Output:

>>> df
   id  building_id  built_year           registered_year  rank
0   1           90        1999 1999-10-10 10:10:10+00:00     3
1   1           89        1925 1925-10-10 10:10:10+00:00     1
2   1           90        2001 2000-10-10 10:10:10+00:00     2
3   1           90        2009 2000-10-10 10:10:10+00:00     1

If you use rank, all columns will be ranked individually:

>>> (df.groupby('building_id')[['built_year', 'registered_year']]
       .transform('rank', method='dense', ascending=False))

   built_year  registered_year
0         3.0              2.0
1         1.0              1.0
2         2.0              1.0
3         1.0              1.0

Pandas group by dense rank partitioned by multiple columns

1 Answers1