1

I have following code, which scales data and then converts it to df. It turns out, that conversion step changes values in array, even both types are the same (float64).

from sklearn.preprocessing import StandardScaler as SKLStandardScaler
import numpy as np
import pandas as pd


df = pd.DataFrame({"a":range(10)})
scaler = SKLStandardScaler()
scaled = scaler.fit_transform(df)
df2 = pd.DataFrame(scaled)

print(scaled.mean(), df2.mean().values)
print(scaled.std(), df2.std().values)
print(scaled.dtype, df2[0].dtypes)

returns:

-6.661338147750939e-17 [-1.11022302e-16]
1.0 [1.05409255]
float64 float64

So I calculated mean and std from array and DataFrame. Even both are from the same data, with the same type (no floating-point conversion error) - values are different. Why is it?

amonowy
  • 188
  • 12
  • While I can't reproduce the difference in the mean, the difference in `std` is because `numpy` and `pandas` uses different `ddof`. Try `scaled.std(ddof=1)` or `df2.std(ddof=0)` – Chris Apr 01 '21 at 05:55
  • 1
    thanks, it explains difference in std perfectly :) – amonowy Apr 01 '21 at 05:58
  • Conversion to dataframe does not actually change anything, as if you run `df2.values` you'll see the same underlying array. How mean is computed is different though. See: https://stackoverflow.com/questions/53042250 – AlexK Apr 01 '21 at 06:07

0 Answers0