I'm trying to standardize a dataset by Chiaretti et al. that can be loaded to a Google Colab notebook like this:
import pandas as pd
import numpy as np
!wget https://github.com/kivancguckiran/microarray-data/raw/master/csv/chiaretti.tar.gz
!tar -zxvf chiaretti.tar.gz
features = pd.read_csv("chiaretti_inputs.csv", header=None)
labels = pd.read_csv("chiaretti_outputs.csv", header=None).squeeze()
I've used three different methods to compute Z-Scores, and I wonder what causes the (small) difference in results between pandas & numpy/sklearn:
n_samples, n_features = features.shape
standardised_sklearn = pd.DataFrame(StandardScaler().fit_transform(features))
standardised_numpy = pd.DataFrame((features - np.mean(features)) / np.std(features))
standardised_pandas = pd.DataFrame((features - features.mean()) / features.std())
#count how many z-scores are less then epsilon=10**-10 apart (between sklearn & numpy)
compare1 = len(np.where(np.abs(standardised_sklearn - standardised_numpy) <= 10**-10)[0])
print(compare1 / (n_samples*n_features)) #will output 100%
#count how many z-scores are less then epsilon=10**-10 apart (between sklearn & pandas)
compare2 = len(np.where(np.abs(standardised_sklearn - standardised_pandas) <= 10**-10)[0])
print(compare2 / (n_samples*n_features)) #will output 0%
#count how many z-scores are less then epsilon=10**-2 apart (between sklearn & pandas)
compare3 = len(np.where(np.abs(standardised_sklearn - standardised_pandas) <= 10**-2)[0])
print(compare3 / (n_samples*n_features)) #will output 98%
We can see that sklearn & numpy are pretty much the same (results differ by a factor of 10**-15), but pandas is very different. I'm wondering what happens "under the hood" that makes mean/std calculations so different in pandas.