1

I have a df that contains id, building_id, built_year, registered_year and some other columns.

In the df, an id can contain multiple building_id. However, the data is not clean. There can be multiple building_id but different built_year and registered_year. I assume that the building got renovated, but rather than updating the data, it's been added.

So the plan for me is to rank the buildings based on the order of built_year, registered_year.

built_year is in the format 1999 and registered_year is in the format 1999-10-10 10:10:10+00:00

id building_id built_year registered_year
1 90 1999 1999-10-10 10:10:10+00:00
1 89 1925 1925-10-10 10:10:10+00:00
1 90 2001 2000-10-10 10:10:10+00:00
1 90 2009 2000-10-10 10:10:10+00:00

This is what I have:

df['built_year'] = pd.to_numeric(df['built_year'], errors='coerce')
df['registered_year'] = pd.to_datetime(df['registered_year'], errors='coerce')

df['rank'] = df.groupby('building_id')[['built_year', 'registered_year']] \
               .transform(lambda x: x.rank(method = 'dense', ascending = False))

id building_id built_year registered_year rank
1 90 1999 1999-10-10 10:10:10+00:00 3
1 89 1925 1925-10-10 10:10:10+00:00 1
1 90 2001 2000-10-10 10:10:10+00:00 2
1 90 2009 2000-10-10 10:10:10+00:00 1

Well, the code would work if I had only ranked it over one column, for instance, on only built_year. But by adding more columns, it gives me the following:

ValueError: Cannot set a DataFrame with multiple columns to the single column rank

Any help would be appreciated!

Corralien
  • 109,409
  • 8
  • 28
  • 52
hjun
  • 73
  • 7
  • 3
    Would you mind providing a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) so it's possible to investigate what's going on? – Simon David May 22 '23 at 21:09
  • I've never used `df.rank`, but it sounds like you just need to match the number of columns, like `df[['rank_built', 'rank_registered']] = ...`. If you provided a [mre], I could at least check if that works. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea May 22 '23 at 21:14
  • Sorting, followed by grouping and cumulatively counting (in descending order) should return the same desired result: `df["rank"] = df.sort_values(by=["built_year", "registered_year"]).groupby().cumcount(ascending=False) + 1` – Rawson May 22 '23 at 21:19

1 Answers1

1

You can simply use sort_values in this case:

df['rank'] = (df.sort_values(['built_year', 'registered_year'], ascending=False)
                .groupby('building_id').cumcount().add(1))

Output:

>>> df
   id  building_id  built_year           registered_year  rank
0   1           90        1999 1999-10-10 10:10:10+00:00     3
1   1           89        1925 1925-10-10 10:10:10+00:00     1
2   1           90        2001 2000-10-10 10:10:10+00:00     2
3   1           90        2009 2000-10-10 10:10:10+00:00     1

If you use rank, all columns will be ranked individually:

>>> (df.groupby('building_id')[['built_year', 'registered_year']]
       .transform('rank', method='dense', ascending=False))

   built_year  registered_year
0         3.0              2.0
1         1.0              1.0
2         2.0              1.0
3         1.0              1.0
Corralien
  • 109,409
  • 8
  • 28
  • 52