Different Z-Scores with numpy/sklearn/pandas

Question

I'm trying to standardize a dataset by Chiaretti et al. that can be loaded to a Google Colab notebook like this:

import pandas as pd
import numpy as np

!wget https://github.com/kivancguckiran/microarray-data/raw/master/csv/chiaretti.tar.gz
!tar -zxvf chiaretti.tar.gz 

features = pd.read_csv("chiaretti_inputs.csv", header=None)
labels = pd.read_csv("chiaretti_outputs.csv", header=None).squeeze()

I've used three different methods to compute Z-Scores, and I wonder what causes the (small) difference in results between pandas & numpy/sklearn:

n_samples, n_features = features.shape
standardised_sklearn = pd.DataFrame(StandardScaler().fit_transform(features))
standardised_numpy = pd.DataFrame((features - np.mean(features)) / np.std(features))
standardised_pandas = pd.DataFrame((features - features.mean()) / features.std())

#count how many z-scores are less then epsilon=10**-10 apart (between sklearn & numpy)
compare1 = len(np.where(np.abs(standardised_sklearn - standardised_numpy) <= 10**-10)[0])
print(compare1 / (n_samples*n_features))  #will output 100%

#count how many z-scores are less then epsilon=10**-10 apart (between sklearn & pandas)
compare2 = len(np.where(np.abs(standardised_sklearn - standardised_pandas) <= 10**-10)[0])
print(compare2 / (n_samples*n_features))  #will output 0%

#count how many z-scores are less then epsilon=10**-2 apart (between sklearn & pandas)
compare3 = len(np.where(np.abs(standardised_sklearn - standardised_pandas) <= 10**-2)[0])
print(compare3 / (n_samples*n_features))  #will output 98%

We can see that sklearn & numpy are pretty much the same (results differ by a factor of 10**-15), but pandas is very different. I'm wondering what happens "under the hood" that makes mean/std calculations so different in pandas.

It's differences in default `ddof` parameter ("delta degrees of freedom") in `std`. ie `numpy` default is 0, `pandas` is 1. See [this answer](https://stackoverflow.com/a/24984205/10201580) — Chris Adams, Jan 10 '20 at 08:20

Different Z-Scores with numpy/sklearn/pandas

0 Answers0