2

My problem setup is as follows: Python 3.7, Pandas version 1.0.3, and sklearn version 0.22.1. I am applying a StandardScaler (to every column of a float matrix) per usual. However, the columns that I get out do not have standard deviation =1, while their mean values are (approximately) 0.

I am not sure what is going wrong here, I have checked whether the scaler got confused and standardised the rows instead but that does not seem to be the case.

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
np.random.seed(1)
row_size = 5
n_obs = 100
X = pd.DataFrame(np.random.randint(0,1000,n_obs).reshape((row_size,int(n_obs/row_size)))

scaler = StandardScaler()
scaler.fit(X)
X_out = scaler.transform(X)
X_out = pd.DataFrame(X_out)

All columns have standard deviation 1.1180... as opposed to 1.

X_out[0].mean()
>>Out[2]: 4.4408920985006264e-17
X_out[0].std()
>>Out[3]: 1.1180339887498947

EDIT: I have realised as I increase row_size above, e.g. from 5 to 10 and 100, the standard deviation of the columns approach 1. So maybe this is to do with the bias of the variance estimator getting smaller as n increases(?). However it does not make sense that I can get unit variance by manually implementing (col[i]- col[i].mean() )/ col[i].std() but the StandardScaler struggles...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Zhubarb
  • 11,432
  • 18
  • 75
  • 114
  • 1. I can reproduce this with your code with pandas. 2. With pure numpy arrays I get std=1. 3. In the past I often had problems with pandas and sklearn, but I cannot tell you what exactly is the problem. But it can be circumvented by passing the np.ndarray `X.values` to sklearn and wrap the result back in a pd.DataFrame. – Niklas Mertsch Jun 20 '20 at 08:09
  • @NiklasMertsch Thank you, let me try that. I checked whether it may be getting confused from the fact that the row names overlap with the column names, but that doesnt seem to be the case either. This may be a bug I suppose.. – Zhubarb Jun 20 '20 at 08:10
  • @NiklasMertsch Passing in `df.values` (numpy.ndarray) gets me the same output with the non-zero standard deviation again.. – Zhubarb Jun 20 '20 at 08:15

1 Answers1

1

Numpy and Pandas use different definitions of standard deviation (biased vs. unbiased). Sklearn uses the numpy definition, thus the result of scaler.transform(X).std(axis=1) results in 1s.

But then you wrap the standardized values X_out in a pandas DataFrame and ask pandas to give you the standard deviation for the same values, which then results in your observation.

For most cases you only care for all columns having the same spread, thus the differences are not important. But if you really want the unbiased standard deviation, you can't use the StandardScaler from sklearn.

Niklas Mertsch
  • 1,399
  • 12
  • 24
  • ah, good point! i sensed it had to do with biases and the numpy vs pandas explanation makes sense. – Zhubarb Jun 20 '20 at 13:54