3

I have a DataFrame with 2000 rows and 4000 columns (observations). I want to calculate the spearman correlation row-wise. Currently I´m using:

df.T.corr(method="spearman")

It seems to take a very long time (20min and still not finished).

Is there a more efficient module?

Can I preprocess the DataFrame to speed things up?

UPDATE: Using scipy.stats.spearmanr is 20x faster

SCC, pval = scp.spearmanr(df, axis=1)
SCC = pd.DataFrame(SCC, index=df.index, columns=df.index)
Benni
  • 795
  • 2
  • 7
  • 20
  • 1
    you may want to check [this](http://stackoverflow.com/questions/25077080/calculate-special-correlation-distance-matrix-faster) – MaxU - stand with Ukraine Dec 11 '16 at 16:39
  • Probably you have answered your own question. Often when you can do something in numpy/scipy it will be significantly faster than pandas (b/c pandas is much more general, but at the cost of some processing overhead). Possibly you could get a speedup with cython or numba, but for a standard calculation like this, using the default numpy/scipy function is probably pretty close to maximum efficiency. (in the link from @maxu there are no timings so hard to say if cython buys much. I tend to doubt it, but you could certainly give it a try, or numba) – JohnE Dec 11 '16 at 18:46
  • The faster spearmanr could be wrong. Please find https://stackoverflow.com/questions/51386399/python-scipy-spearman-correlation-for-matrix-does-not-match-two-array-correlatio – Chih-Hsu Jack Lin Jul 17 '18 at 21:58
  • The problem of rowbased spearman on two dfs is solved here: https://stackoverflow.com/questions/52371329/fast-spearman-correlation-between-two-pandas-dataframes/59072032#59072032 – The Unfun Cat Dec 02 '19 at 12:41

0 Answers0