pandas groupby keeping other columns

Question

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).

I have data on firms over time. This generates some dummy data that looks like my use case:

import numpy as np
import pandas as pd

dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
                  index=np.tile(dates,nfirms),
                  columns=cols)

df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))

I need to calculate ranks of column E within each date (the index), but keeping column id.

If I just use groupby and .rank I get this:

df.groupby(level=0)['E'].rank()

1926-01-31    3226.0
1926-02-28    1042.0
1926-03-31    1611.0
1926-04-30    2591.0
1926-05-31      30.0
               ...  
2019-08-31    1973.0
2019-09-30     227.0
2019-10-31    4381.0
2019-11-30    1654.0
2019-12-31    1572.0
Name: E, Length: 5640000, dtype: float64

This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?

If the order in the output is the same as in the output, I think I can do this:

df['ranks'] = df.groupby(level=0)['E'].rank()

But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.

(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

It should be safe to assign back with your code. And yes, `rank()` is pretty much the same with `transform('rank')`. This is because `rank` returns a series of the same length in each group. — Quang Hoang, Sep 01 '20 at 18:17

pandas groupby keeping other columns

0 Answers0