Pandas array of pairwise correlations between columns

Question

I want to run a Spearman correlation of each column vs all the other columns in pandas. I need only the distribution of correlations (array), not a correlation matrix.

I know that I could use df.corr(method='spearman'), however I need only the pairwise correlation, not the entire correlation matrix or the diagonal. I think this may speed up the computation, since I will be only computing ((N^2) - N)/2 correlations, instead of N^2.

However, this is just an assumption - since the matrix would be symmetric, maybe pandas already works by computing one half of the correlation matrix and then filling the rest accordingly.

By now my, very inefficient, solution is:

import pandas as pd
import scipy.stats as ss

# d is a pandas DataFrame

corr_a = []
for i, col1 in enumerate(d.columns):
    for col2 in d.columns[i+1:]:
        r, _ = ss.spearmanr(d.loc[col1], d.loc[col2])
        corr_a += [r]

Is there any, builtin or vectorized, API to run this faster?

Almost any pandas built-in method is faster than anything you try to implement using Python loops. `df.corr()` is as fast as you can get. — DYZ, Jan 22 '18 at 22:23
Agree with @DYZ. if you look at the source code of the `corr` function, they are already optimizing the calculation. — Zhiya, Jan 22 '18 at 22:25
Yes, mine was kind of obvious statement.. I think maybe the best solution is to clip only the lower triangular (such as here: https://stackoverflow.com/a/34418376/41977). I think this may be a duplicate. — gc5, Jan 22 '18 at 22:27

gc5 · Answer 1 · 2018-01-23T15:25:50.413

The pandas solution was actually easier than I thought:

import numpy as np
import pandas as pd

# d is a pandas DataFrame
d = d.corr(method='spearman')
d = d.where(np.triu(np.ones(d.shape)).astype(np.bool))
np.fill_diagonal(d.values, np.nan)
d = d.stack().reset_index()
corr = d.iloc[:, 2]

Feel free to edit if you can provide a way to compute only half of the correlation matrix (my original matrix is high dimensional so the computational cost of this solution may explode).

Pandas array of pairwise correlations between columns

1 Answers1