scale numerical values for different groups in python

Question

I want to scale the numerical values (similar like R's scale function) based on different groups.

Noted: when I talked about the scale, I am referring to this metric (x-group_mean)/group_std

Dataset (for demonstration the ideas) for example:

advertiser_id   value
10              11
10              22
10              2424
11              34
11              342342
.....

Desirable results:

advertiser_id   scaled_value
10              -0.58
10              -0.57
10              1.15
11              -0.707
11              0.707
.....

referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:

dt.groupby("advertiser_id").apply(scale)

but get an error:

ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)

In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.

I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!

I think in your case you can just do : `dt.groupby("advertiser_id").apply(lambda x: x /= x.std())` as that is the flavour you're after, the error comes about because in that answer it's operating on the orig df without any grouping performed — EdChum, Aug 27 '15 at 22:42
It is not the case. The formula is like this ' (x-x.mean())/x.std)'. — Surah Li, Aug 27 '15 at 23:01
So you want: `df.groupby('advertiser_id')['value'].apply(lambda x: (x- x.mean())/x.std())`? — EdChum, Aug 27 '15 at 23:03
I tried and it should work. But the thing is, for some particular rows, it doesn't return to any meaningful float (but NaN instead). I checked with my original values and noticed that sometimes they can be really small (i.e. 0.002016), I don't know if it is the case why the scale function won't work. — Surah Li, Aug 27 '15 at 23:32

score 1 · Accepted Answer · answered Aug 28 '15 at 02:59

First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:

In [9]:

print df

   advertiser_id   value
0             10      11
1             10      22
2             10    2424
3             11      34
4             11  342342

In [10]:

print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))

      value
0 -0.581303
1 -0.573389
2  1.154691
3 -0.707107
4  0.707107

This matches R result.

2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

scale numerical values for different groups in python

1 Answers1