0

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).

I have data on firms over time. This generates some dummy data that looks like my use case:

import numpy as np
import pandas as pd

dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
                  index=np.tile(dates,nfirms),
                  columns=cols)

df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))

I need to calculate ranks of column E within each date (the index), but keeping column id.

If I just use groupby and .rank I get this:

df.groupby(level=0)['E'].rank()

1926-01-31    3226.0
1926-02-28    1042.0
1926-03-31    1611.0
1926-04-30    2591.0
1926-05-31      30.0
               ...  
2019-08-31    1973.0
2019-09-30     227.0
2019-10-31    4381.0
2019-11-30    1654.0
2019-12-31    1572.0
Name: E, Length: 5640000, dtype: float64

This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?

If the order in the output is the same as in the output, I think I can do this:

df['ranks'] = df.groupby(level=0)['E'].rank()

But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.

(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

itzy
  • 11,275
  • 15
  • 63
  • 96

0 Answers0