I have a df that contains id, building_id, built_year, registered_year
and some other columns.
In the df, an id can contain multiple building_id
. However, the data is not clean. There can be multiple building_id
but different built_year
and registered_year
. I assume that the building got renovated, but rather than updating the data, it's been added.
So the plan for me is to rank the buildings based on the order of built_year
, registered_year
.
built_year
is in the format 1999 and registered_year
is in the format 1999-10-10 10:10:10+00:00
id | building_id | built_year | registered_year |
---|---|---|---|
1 | 90 | 1999 | 1999-10-10 10:10:10+00:00 |
1 | 89 | 1925 | 1925-10-10 10:10:10+00:00 |
1 | 90 | 2001 | 2000-10-10 10:10:10+00:00 |
1 | 90 | 2009 | 2000-10-10 10:10:10+00:00 |
This is what I have:
df['built_year'] = pd.to_numeric(df['built_year'], errors='coerce')
df['registered_year'] = pd.to_datetime(df['registered_year'], errors='coerce')
df['rank'] = df.groupby('building_id')[['built_year', 'registered_year']] \
.transform(lambda x: x.rank(method = 'dense', ascending = False))
id | building_id | built_year | registered_year | rank |
---|---|---|---|---|
1 | 90 | 1999 | 1999-10-10 10:10:10+00:00 | 3 |
1 | 89 | 1925 | 1925-10-10 10:10:10+00:00 | 1 |
1 | 90 | 2001 | 2000-10-10 10:10:10+00:00 | 2 |
1 | 90 | 2009 | 2000-10-10 10:10:10+00:00 | 1 |
Well, the code would work if I had only ranked it over one column, for instance, on only built_year
. But by adding more columns, it gives me the following:
ValueError: Cannot set a DataFrame with multiple columns to the single column rank
Any help would be appreciated!