Difference between pandas.corr and scipy.spearmanr

Asked Dec 16 '21 at 14:58

Active Dec 16 '21 at 15:03

Viewed 17 times

I am looking to efficiently calculate the Spearman correlation for a large (873901122, 273) dataset.

Both pandas and scipy offer implementations of this. However, their implementations produce different results.

Minimal example below:

from numpy.lib.npyio import load
from sklearn.datasets import load_iris
import pandas as pd, numpy as np, scipy.stats as st

x = load_iris(return_X_y=True, as_frame=True)
X = x[0]

scipy_spearman = st.spearmanr(X)[0]
pandas_spearman = np.asarray(X.corr(method='spearman'))
diff = np.subtract(scipy_spearman, pandas_spearman)

print(np.array_equal(scipy_spearman, pandas_spearman))
print(diff)

Produces:

False
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.11022302e-16  0.00000000e+00  5.55111512e-17]
 [ 1.11022302e-16  0.00000000e+00  0.00000000e+00 -1.11022302e-16]
 [ 0.00000000e+00  5.55111512e-17 -1.11022302e-16 -1.11022302e-16]]

If they are both calculating the Spearman correlation with no randomness on the same data, why are the results different?

edited Dec 16 '21 at 15:03

asked Dec 16 '21 at 14:58

artemis

6,857
11
46
99

1

Lesson learn, due to floating point precision, never use `==` on floats, use `np.isclose` or `np.allclose` – Quang Hoang Dec 16 '21 at 15:01

Difference between pandas.corr and scipy.spearmanr

0 Answers0