Why two different normalized results from Python vs R

Question

Can anyone explain the math behind the scenes? why Python and R return me the different result? which one should I use for real-world business scenario?

original data

id  cost    sales   item
1   300      50     pen
2   3        88     wf
3   1        70     gher
4   5        80     dger
5   2        999    ww

Python code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('Scale.csv')
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df

Python normalized result

    id     cost        sales    item
0   1   1.999876    -0.559003   pen
1   2   -0.497867   -0.456582   wf
2   3   -0.514686   -0.505097   gher
3   4   -0.481047   -0.478144   dger
4   5   -0.506276   1.998826    ww

and R code

library(readr)
library(dplyr)
df <- read_csv("C:/Users/Ho/Desktop/Scale.csv")
df <- df %>% mutate_each_(funs(scale(.) %>% as.vector), 
                             vars=c("cost","sales"))

R normalized result

   id   cost     sales     item 

1   1   1.7887437   -0.4999873  pen
2   2   -0.4453054  -0.4083792  wf
3   3   -0.4603495  -0.4517725  gher
4   4   -0.4302613  -0.4276651  dger
5   5   -0.4528275  1.7878041   ww

thanks @Wen

It would help if we actually had Scale.csv so we could run and verify this ourselves. — joran, Apr 04 '18 at 20:04
I don't use Python much but it looks like it's transformation is using n in the denominator when calculating the variance to scale by and R is using (n-1) in the denominator. — Dason, Apr 04 '18 at 20:08
@Dason i.e. R is using the _correct_ denominator. (I'm teasing.) — joran, Apr 04 '18 at 20:12
@joran makes sense as R is for statisticians who think they are programmers and Python is for programmers who think they are statisticians. — ngm, Apr 04 '18 at 20:21

score 6 · Accepted Answer · answered Apr 04 '18 at 20:13

I don't use those functions in Python much but the data seems to imply that the difference is that the functions in Python use 'n' when calculating the variance to standardize with and R uses 'n-1'. We can convert between the two by multiplying and the following shows that after multiplying by sqrt(5/4) the data from R matches the Python values.

> tab <- read.table(textConnection("1   1   1.7887437   -0.4999873  pen
+ 2   2   -0.4453054  -0.4083792  wf
+ 3   3   -0.4603495  -0.4517725  gher
+ 4   4   -0.4302613  -0.4276651  dger
+ 5   5   -0.4528275  1.7878041   ww"))
> tab
  V1 V2                   V3                   V4   V5
1  1  1  1.78874369999999994 -0.49998730000000002  pen
2  2  2 -0.44530540000000002 -0.40837920000000000   wf
3  3  3 -0.46034950000000002 -0.45177250000000002 gher
4  4  4 -0.43026130000000001 -0.42766510000000002 dger
5  5  5 -0.45282749999999999  1.78780410000000001   ww
> # To transform as if we used n in the denominator instead of
> # n-1 we just multiply by sqrt(n/(n-1))
> tab$V3 * sqrt(5/4)
[1]  1.99987625376224520 -0.49786657257386746 -0.51468638770401975
[4] -0.48104675744371517 -0.50627653604064304
> tab$V4 * sqrt(5/4)
[1] -0.55900279534329034 -0.45658182589849106 -0.50509701018251196
[4] -0.47814411760212272  1.99882574902641608

Actually, what @joran pointed out as a teasing is somewhat right. As the sample size is really small using `n-1` is statistically more robust. — M--, Apr 04 '18 at 20:15
@BigData it most likely doesn't matter. The difference is a rounding error with datasets large enough for you to be using a machine learning algorithm in a real application, which is what I assume you are hoping to use `scikit-learn` or anything similar in R. — ngm, Apr 04 '18 at 20:47

Why two different normalized results from Python vs R

1 Answers1