I am looking to efficiently calculate the Spearman correlation for a large (873901122, 273
) dataset.
Both pandas and scipy offer implementations of this. However, their implementations produce different results.
Minimal example below:
from numpy.lib.npyio import load
from sklearn.datasets import load_iris
import pandas as pd, numpy as np, scipy.stats as st
x = load_iris(return_X_y=True, as_frame=True)
X = x[0]
scipy_spearman = st.spearmanr(X)[0]
pandas_spearman = np.asarray(X.corr(method='spearman'))
diff = np.subtract(scipy_spearman, pandas_spearman)
print(np.array_equal(scipy_spearman, pandas_spearman))
print(diff)
Produces:
False
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 -1.11022302e-16 0.00000000e+00 5.55111512e-17]
[ 1.11022302e-16 0.00000000e+00 0.00000000e+00 -1.11022302e-16]
[ 0.00000000e+00 5.55111512e-17 -1.11022302e-16 -1.11022302e-16]]
If they are both calculating the Spearman correlation with no randomness on the same data, why are the results different?