11

I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Date column and a ID column, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col (take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.

This can be accomplished easily in SAS by following code and it also illustrate what my purpose:

proc sort data=df;
    by Date;
proc rank data=df out=df_ranked groups=10;
    var V1;
    ranks rank_col;
    by Date;
run;

df_ranked is identical to df except that it has more column called rank_col which containes the rank group that each row belongs to.

Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.

Thanks for your help!

user3576212
  • 3,255
  • 9
  • 25
  • 33

2 Answers2

8

A way I just find figured out:

def grouping(data):
    dec=pd.qcut(data['V1'],10,labels=False)
    data['ranks']=dec
    return data
df_ranked=df.groupby('Date').apply(grouping)

This assumes dec preserve the right position for each row.

Please post if you have a better way, or point out any mistakes in this method.

Thanks!

Edit: You can just return a single new ranks column if you do something like the following:

>>> df


         Date  id  V1
0  2013-01-01   1  10
1  2013-01-01   2   8
2  2013-01-01   3   6
3  2013-01-01   4  11
4  2013-01-01   5  13
5  2013-01-01   6   4
6  2013-01-01   7   2
7  2013-02-01   1   1
8  2013-02-01   2   3
9  2013-02-01   3   9
10 2013-02-01   4  11
11 2013-02-01   5   7
12 2013-02-01   6   4
13 2013-02-01   7   6
14 2013-02-01   8  14

>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df

         Date  id  V1  ranks
0  2013-01-01   1  10      6
1  2013-01-01   2   8      4
2  2013-01-01   3   6      3
3  2013-01-01   4  11      8
4  2013-01-01   5  13      9
5  2013-01-01   6   4      1
6  2013-01-01   7   2      0
7  2013-02-01   1   1      0
8  2013-02-01   2   3      1
9  2013-02-01   3   9      7
10 2013-02-01   4  11      8
11 2013-02-01   5   7      5
12 2013-02-01   6   4      2
13 2013-02-01   7   6      4
14 2013-02-01   8  14      9
Karl D.
  • 13,332
  • 5
  • 56
  • 38
user3576212
  • 3,255
  • 9
  • 25
  • 33
  • It's a reasonable way to do it. You could change the apply a little to only return a `ranks` Series. That would allow you to just assign a new `ranks` column to the original dataframe as the result of the `groupby/apply`. But your way works just fine. – Karl D. Jun 01 '14 at 02:53
  • @KarlD. Thanks for conforming. and I just found out, if I can sort the dataframe based on the groupby variable first, the groupby operation can be much faster. – user3576212 Jun 01 '14 at 03:26
  • @KarlD. I tried to only return rank series `dec`, the groupby will append the whole series to each row-column (cell) – user3576212 Jun 01 '14 at 03:36
  • If you don't mind, I can add a version that just returns a column to your answer? – Karl D. Jun 01 '14 at 03:41
  • @KarlD. Much cleaner! – user3576212 Jun 01 '14 at 03:58
3

It could be simpler without needing separate foo

In [782]: df.groupby('Date')['V1'].transform(lambda x: pd.qcut(x, 10, labels=False))
Out[782]:
0     6
1     4
2     3
3     8
4     9
5     1
6     0
7     0
8     1
9     7
10    8
11    5
12    2
13    4
14    9
Name: V1, dtype: int64

Assign to column

In [783]: df['ranks'] = df.groupby('Date')['V1'].transform(pd.qcut, 10, labels=False)

In [784]: df
Out[784]:
          Date  id  V1  ranks
0   2013-01-01   1  10      6
1   2013-01-01   2   8      4
2   2013-01-01   3   6      3
3   2013-01-01   4  11      8
4   2013-01-01   5  13      9
5   2013-01-01   6   4      1
6   2013-01-01   7   2      0
7   2013-02-01   1   1      0
8   2013-02-01   2   3      1
9   2013-02-01   3   9      7
10  2013-02-01   4  11      8
11  2013-02-01   5   7      5
12  2013-02-01   6   4      2
13  2013-02-01   7   6      4
14  2013-02-01   8  14      9

Details

In [786]: df
Out[786]:
          Date  id  V1
0   2013-01-01   1  10
1   2013-01-01   2   8
2   2013-01-01   3   6
3   2013-01-01   4  11
4   2013-01-01   5  13
5   2013-01-01   6   4
6   2013-01-01   7   2
7   2013-02-01   1   1
8   2013-02-01   2   3
9   2013-02-01   3   9
10  2013-02-01   4  11
11  2013-02-01   5   7
12  2013-02-01   6   4
13  2013-02-01   7   6
14  2013-02-01   8  14
Zero
  • 74,117
  • 18
  • 147
  • 154