4

Can anyone explain the math behind the scenes? why Python and R return me the different result? which one should I use for real-world business scenario?

original data

id  cost    sales   item
1   300      50     pen
2   3        88     wf
3   1        70     gher
4   5        80     dger
5   2        999    ww

Python code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('Scale.csv')
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df

Python normalized result

    id     cost        sales    item
0   1   1.999876    -0.559003   pen
1   2   -0.497867   -0.456582   wf
2   3   -0.514686   -0.505097   gher
3   4   -0.481047   -0.478144   dger
4   5   -0.506276   1.998826    ww

and R code

library(readr)
library(dplyr)
df <- read_csv("C:/Users/Ho/Desktop/Scale.csv")
df <- df %>% mutate_each_(funs(scale(.) %>% as.vector), 
                             vars=c("cost","sales"))

R normalized result

   id   cost     sales     item 

1   1   1.7887437   -0.4999873  pen
2   2   -0.4453054  -0.4083792  wf
3   3   -0.4603495  -0.4517725  gher
4   4   -0.4302613  -0.4276651  dger
5   5   -0.4528275  1.7878041   ww

thanks @Wen

BigData
  • 397
  • 2
  • 3
  • 13

1 Answers1

6

I don't use those functions in Python much but the data seems to imply that the difference is that the functions in Python use 'n' when calculating the variance to standardize with and R uses 'n-1'. We can convert between the two by multiplying and the following shows that after multiplying by sqrt(5/4) the data from R matches the Python values.

> tab <- read.table(textConnection("1   1   1.7887437   -0.4999873  pen
+ 2   2   -0.4453054  -0.4083792  wf
+ 3   3   -0.4603495  -0.4517725  gher
+ 4   4   -0.4302613  -0.4276651  dger
+ 5   5   -0.4528275  1.7878041   ww"))
> tab
  V1 V2                   V3                   V4   V5
1  1  1  1.78874369999999994 -0.49998730000000002  pen
2  2  2 -0.44530540000000002 -0.40837920000000000   wf
3  3  3 -0.46034950000000002 -0.45177250000000002 gher
4  4  4 -0.43026130000000001 -0.42766510000000002 dger
5  5  5 -0.45282749999999999  1.78780410000000001   ww
> # To transform as if we used n in the denominator instead of
> # n-1 we just multiply by sqrt(n/(n-1))
> tab$V3 * sqrt(5/4)
[1]  1.99987625376224520 -0.49786657257386746 -0.51468638770401975
[4] -0.48104675744371517 -0.50627653604064304
> tab$V4 * sqrt(5/4)
[1] -0.55900279534329034 -0.45658182589849106 -0.50509701018251196
[4] -0.47814411760212272  1.99882574902641608
Dason
  • 60,663
  • 9
  • 131
  • 148
  • 3
    Actually, what @joran pointed out as a teasing is somewhat right. As the sample size is really small using `n-1` is statistically more robust. – M-- Apr 04 '18 at 20:15
  • 2
    Thats OK cause Python is only for BIIIIiiiiG data. – Stephen Henderson Apr 04 '18 at 20:17
  • which one should I use for real-world business scenario? – BigData Apr 04 '18 at 20:43
  • @BigData it most likely doesn't matter. The difference is a rounding error with datasets large enough for you to be using a machine learning algorithm in a real application, which is what I assume you are hoping to use `scikit-learn` or anything similar in R. – ngm Apr 04 '18 at 20:47