Detailed Example of Normalization Methods
- Pandas normalization (unbiased)
- Sklearn normalization (biased)
- Does biased-vs-unbiased affect Machine Learning?
- Mix-max scaling
References:
Wikipedia: Unbiased Estimation of Standard Deviation
Example Data
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
Normalization using pandas (Gives unbiased estimates)
When normalizing we simply subtract the mean and divide by standard deviation.
df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
A B C
0 -1.0 -1.0 a
1 0.0 0.0 b
2 1.0 1.0 c
Normalization using sklearn (Gives biased estimates, different from pandas)
If you do the same thing with sklearn
you will get DIFFERENT output!
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
A B C
0 -1.224745 -1.224745 a
1 0.000000 0.000000 b
2 1.224745 1.224745 c
Does Biased estimates of sklearn makes Machine Learning Less Powerful?
NO.
The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.
From official documentation:
We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0)
. Note that the choice of ddof
is unlikely to affect model performance.
What about MinMax Scaling?
There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
})
(df - df.min()) / (df.max() - df.min())
A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
# Using sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
arr_scaled = scaler.fit_transform(df)
print(arr_scaled)
[[0. 0. ]
[0.5 0.5]
[1. 1. ]]
df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0