My problem setup is as follows: Python 3.7, Pandas version 1.0.3, and sklearn version 0.22.1. I am applying a StandardScaler (to every column of a float matrix) per usual. However, the columns that I get out do not have standard deviation =1, while their mean values are (approximately) 0.
I am not sure what is going wrong here, I have checked whether the scaler
got confused and standardised the rows instead but that does not seem to be the case.
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
np.random.seed(1)
row_size = 5
n_obs = 100
X = pd.DataFrame(np.random.randint(0,1000,n_obs).reshape((row_size,int(n_obs/row_size)))
scaler = StandardScaler()
scaler.fit(X)
X_out = scaler.transform(X)
X_out = pd.DataFrame(X_out)
All columns have standard deviation 1.1180...
as opposed to 1.
X_out[0].mean()
>>Out[2]: 4.4408920985006264e-17
X_out[0].std()
>>Out[3]: 1.1180339887498947
EDIT:
I have realised as I increase row_size
above, e.g. from 5 to 10 and 100, the standard deviation of the columns approach 1. So maybe this is to do with the bias of the variance estimator getting smaller as n increases(?). However it does not make sense that I can get unit variance by manually implementing (col[i]- col[i].mean() )/ col[i].std()
but the StandardScaler struggles...