3

I want to scale the numerical values (similar like R's scale function) based on different groups.

Noted: when I talked about the scale, I am referring to this metric (x-group_mean)/group_std

Dataset (for demonstration the ideas) for example:

advertiser_id   value
10              11
10              22
10              2424
11              34
11              342342
.....

Desirable results:

advertiser_id   scaled_value
10              -0.58
10              -0.57
10              1.15
11              -0.707
11              0.707
.....

referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:

dt.groupby("advertiser_id").apply(scale)

but get an error:

ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)

In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.

I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!

Community
  • 1
  • 1
Surah Li
  • 573
  • 1
  • 4
  • 6
  • I think in your case you can just do : `dt.groupby("advertiser_id").apply(lambda x: x /= x.std())` as that is the flavour you're after, the error comes about because in that answer it's operating on the orig df without any grouping performed – EdChum Aug 27 '15 at 22:42
  • It is not the case. The formula is like this ' (x-x.mean())/x.std)'. – Surah Li Aug 27 '15 at 23:01
  • So you want: `df.groupby('advertiser_id')['value'].apply(lambda x: (x- x.mean())/x.std())`? – EdChum Aug 27 '15 at 23:03
  • I tried and it should work. But the thing is, for some particular rows, it doesn't return to any meaningful float (but NaN instead). I checked with my original values and noticed that sometimes they can be really small (i.e. 0.002016), I don't know if it is the case why the scale function won't work. – Surah Li Aug 27 '15 at 23:32

1 Answers1

1

First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:

In [9]:

print df

   advertiser_id   value
0             10      11
1             10      22
2             10    2424
3             11      34
4             11  342342

In [10]:

print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))

      value
0 -0.581303
1 -0.573389
2  1.154691
3 -0.707107
4  0.707107

This matches R result.

2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

CT Zhu
  • 52,648
  • 17
  • 120
  • 133