1

I am looking to efficiently calculate the Spearman correlation for a large (873901122, 273) dataset.

Both pandas and scipy offer implementations of this. However, their implementations produce different results.

Minimal example below:

from numpy.lib.npyio import load
from sklearn.datasets import load_iris
import pandas as pd, numpy as np, scipy.stats as st

x = load_iris(return_X_y=True, as_frame=True)
X = x[0]

scipy_spearman = st.spearmanr(X)[0]
pandas_spearman = np.asarray(X.corr(method='spearman'))
diff = np.subtract(scipy_spearman, pandas_spearman)

print(np.array_equal(scipy_spearman, pandas_spearman))
print(diff)

Produces:

False
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.11022302e-16  0.00000000e+00  5.55111512e-17]
 [ 1.11022302e-16  0.00000000e+00  0.00000000e+00 -1.11022302e-16]
 [ 0.00000000e+00  5.55111512e-17 -1.11022302e-16 -1.11022302e-16]]

If they are both calculating the Spearman correlation with no randomness on the same data, why are the results different?

artemis
  • 6,857
  • 11
  • 46
  • 99

0 Answers0