This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum
, but I need to use rank
(so the difference is like that between agg
and transform
).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E
within each date (the index), but keeping column id
.
If I just use groupby and .rank
I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df
but I'm not sure it's safe to merge on the index — I really need to join on the id
column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank()
is equivalent to .transform('rank')
.)