0

I'm trying to normalize a pandas dataframe while grouping it based on the dates.

My dataset looks like this:

date permno ret cumret mom1m mom3m mom6m
2004-01-30 80000 0.053 1.497 0.067 0.140 0.137
2004-02-29 80000 0.053 1.497 0.067 0.140 0.137
2004-03-31 80000 0.053 1.497 0.067 0.140 0.137
2004-01-30 80001 0.053 1.497 0.067 0.140 0.137
2004-02-29 80001 0.053 1.497 0.067 0.140 0.137
2004-03-31 80001 0.053 1.497 0.067 0.140 0.137

I'm trying to scale mom1m, mom3m, mom6m based on the dates.

So the first row should be scaled with the 4th row, the second row should be scaled with the 5th row, the third row should be scaled with the last row.

What I've tried is

crsp2[scale_cols] = crsp2.groupby('date')[scale_cols].apply(lambda x: StandardScaler().fit_transform(x))

where crsp2 is the dataframe i'm trying to scale and scale_cols is the list of features I'm trying to scale.

gejyn14
  • 53
  • 6

2 Answers2

1

Thank to this answer, you can do what you want with the below example code.

from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'group':[1,1,1,1,2,2,2,2],
    'value':[1,2,3,4,5,6,9,11],
    'value2':[2,3,3,2,10,8,11,10]
})
df[['value', 'value2']] = df.groupby('group').transform(lambda x: StandardScaler().fit_transform(x.values[:,np.newaxis]).ravel())
group value value2
1 -1.34164 -1
1 -0.447214 1
1 0.447214 1
1 1.34164 -1
2 -1.15311 0.229416
2 -0.733799 -1.60591
2 0.524142 1.14708
2 1.36277 0.229416
ThSorn
  • 507
  • 4
  • 7
  • https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object here is other thread you may want. – ThSorn May 27 '22 at 17:03
0

A simpler solution could use scale() the pipelined version of the StandardScaler.

Your code would look like this:

from sklearn.preprocessing import scale

# set permno and date as multi-index
crsp2.set_index(keys =["date", "permno"],drop=True)

# columns to scale
scale_cols = ["mom1m","mom3m", "mom6m"]

# apply scaler rankwise
crsp2[scale_cols] = crsp2.groupby('date')[scale_cols].transform(lambda x: scale(x))

Output:

date    permno  ret cumret  mom1m   mom3m   mom6m
0   2004-01-30  80000   0.053   1.497   0.0 0.0 0.0
1   2004-02-29  80000   0.053   1.497   0.0 0.0 0.0
2   2004-03-31  80000   0.053   1.497   0.0 0.0 0.0
3   2004-01-30  80001   0.053   1.497   0.0 0.0 0.0
4   2004-02-29  80001   0.053   1.497   0.0 0.0 0.0
5   2004-03-31  80001   0.053   1.497   0.0 0.0 0.0
KarelZe
  • 1,466
  • 1
  • 11
  • 21