Fast spearman correlation between two pandas dataframes

Question

I want to apply spearman correlation to two pandas dataframes with the same number of columns (correlation of each pair of rows).

My objective is to compute the distribution of spearman correlations between each pair of rows (r, s) where r is a row from the first dataframe and s is a row from the second dataframe.

I am aware that similar questions have been answered before (see this). However, this question differs because I want to compare each row of first dataframe with ALL the rows in the second. Additionally, this is computationally intensive and it takes hours due to the size of my data. I want to parallelize it and possibly to rewrite it in order to speed it up.

I tried with numba but unfortunately it fails (similar issue to this) because it seems to not recognize scipy spearmanr. My code is the following:

def corr(a, b):
    dist = []
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
    return dist

piRSquared · Accepted Answer · 2018-09-17T21:12:59.803

NEW ANSWER

from numba import njit
import pandas as pd
import numpy as np

@njit
def mean1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].mean()
  return b

@njit
def std1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].std()
  return b

@njit
def c(a, b):
    ''' Correlation '''
    n, k = a.shape
    m, k = b.shape

    mu_a = mean1(a)
    mu_b = mean1(b)
    sig_a = std1(a)
    sig_b = std1(b)

    out = np.empty((n, m))

    for i in range(n):
        for j in range(m):
            out[i, j] = (a[i] - mu_a[i]) @ (b[j] - mu_b[j]) / k / sig_a[i] / sig_b[j]

    return out

r = df_test.rank(1).values
df_test.T.corr('spearman') == c(r, r)

OLD ANSWER

Doing a Spearman Rank correlation is simply doing a correlation of the ranks.

Rank

We can leverage argsort to get ranks. Though the argsort of the argsort does get us the ranks, we can limit ourselves to one sort by slice assigning.

def rank(a):
  i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')

  s = a.argsort(1)
  out = np.empty_like(s)
  out[i, s] = j

  return out

Correlation

In the case of correlating ranks, the means and standard deviations are all predetermined by the size of the second dimension of the array.

You can accomplish this same thing without numba, but I'm assuming you want it.

from numba import njit

@njit
def c(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  out = np.empty((n, m))

  a = a - mu
  b = b - mu

  for i in range(n):
    for j in range(m):
      out[i, j] = a[i] @ b[j] / k / sig ** 2

  return out

For posterity, we could avoid the internal loop altogether but this might have memory issues.

@njit
def c1(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  a = a - mu
  b = b - mu

  return a @ b.T / k / sig ** 2

Demonstration

np.random.seed([3, 1415])

a = np.random.randn(2, 10)
b = np.random.randn(2, 10)

rank_a = rank(a)
rank_b = rank(b)

c(rank_a, rank_b)

array([[0.32121212, 0.01818182],
       [0.13939394, 0.55151515]])

If you were working with DataFrame

da = pd.DataFrame(a)
db = pd.DataFrame(b)

pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)


          0         1
0  0.321212  0.018182
1  0.139394  0.551515

Validation

We can do a quick validation using pandas.DataFrame.corr

pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)

      0     1
0  True  True
1  True  True

I'm sorry but I didn't get the same results using the two functions. I did a basic correlation of the same matrix (3x3) and then subtracted the two results. The diagonal is correct, other coefficients are different than 0 - I think the problem is in the `c()` function since the `rank()` is working correctly. Thanks — gc5, Sep 17 '18 at 19:49
Can you explain more? Show the data you used, steps you took, and results? — piRSquared, Sep 17 '18 at 19:55
Yes, sorry - the comment space was very short. Here's a pastebin: https://pastebin.com/M5mFz8FJ — gc5, Sep 17 '18 at 20:04
I have also to add that I'm not biased towards numba, if it's faster ok. But I want the fastest methodology. — gc5, Sep 17 '18 at 20:10
I get all zeros. Can you edit that paste bin to include an [mcve]. Meaning, include the data. That way, I can copy and paste directly from the paste bin and run it. — piRSquared, Sep 17 '18 at 20:15
I can't access dropbox from where I'm at. I'm talking about providing a small sample of data within your paste bin just so we can make sure we're using the same data as we debug. — piRSquared, Sep 17 '18 at 20:27
This should work https://pastebin.com/6f1j5Ybn . Let me know if it's ok. Thanks. I noticed that it may be related to ties resolution in the ranks.. maybe your code give different tie values? Since the data is very sparse and I expect a lot of ties. — gc5, Sep 17 '18 at 20:37
Ahh! I see now. That's a big problem. I didn't anticipate a sparse type array. Meaning, I didn't expect lot's of zeroes. I have to rethink. — piRSquared, Sep 17 '18 at 20:46
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180229/discussion-between-gc5-and-pirsquared). — gc5, Sep 17 '18 at 20:52
This works, but I question its performance. See if that helps you out. — piRSquared, Sep 17 '18 at 21:13

The Unfun Cat · Answer 2 · 2019-11-28T08:26:27.473

Here is a rowbased, uncompiled version of scipy.stats.spearmanr that uses ~ 5% of the time on large datasets with an example showing it produces the same result:

import numpy as np

import pandas as pd


def spearman_row(x, y):

    x = np.asarray(x)
    y = np.asarray(y)

    rx = rankdata_average(x)
    ry = rankdata_average(y)

    # print(rx)
    # print(ry)

    return compute_corr(rx, ry)

def compute_corr(x, y):

    # Thanks to https://github.com/dengemann

    def ss(a, axis):
        return np.sum(a * a, axis=axis)

    x = np.asarray(x)
    y = np.asarray(y)

    mx = x.mean(axis=-1)
    my = y.mean(axis=-1)

    xm, ym = x - mx[..., None], y - my[..., None]

    r_num = np.add.reduce(xm * ym, axis=-1)
    r_den = np.sqrt(ss(xm, axis=-1) * ss(ym, axis=-1))

    with np.errstate(divide='ignore', invalid="ignore"):

        r = r_num / r_den

    return r


def rankdata_average(data):

    """Row-based rankdata using method=mean"""

    dc = np.asarray(data).copy()
    sorter = np.apply_along_axis(np.argsort, 1, data)

    inv = np.empty(data.shape, np.intp)

    ranks = np.tile(np.arange(data.shape[1]), (len(data), 1))

    np.put_along_axis(inv, sorter, ranks, axis=1)

    dc = np.take_along_axis(dc, sorter, 1)

    res = np.apply_along_axis(lambda r: r[1:] != r[:-1], 1, dc)

    obs = np.column_stack([np.ones(len(res), dtype=bool), res])

    dense = np.take_along_axis(np.apply_along_axis(np.cumsum, 1, obs), inv, 1)

    len_r = obs.shape[1]

    nonzero = np.count_nonzero(obs, axis=1)
    obs = pd.DataFrame(obs)
    nonzero = pd.Series(nonzero)
    dense = pd.DataFrame(dense)

    ranks = []
    for _nonzero, nzdf in obs.groupby(nonzero, sort=False):

        nz = np.apply_along_axis(lambda r: np.nonzero(r)[0], 1, nzdf)

        _count = np.column_stack([nz, np.ones(len(nz)) * len_r])
        _dense = dense.reindex(nzdf.index).values

        _result = 0.5 * (np.take_along_axis(_count, _dense, 1) + np.take_along_axis(_count, _dense - 1, 1) + 1)

        result = pd.DataFrame(_result, index=nzdf.index)
        ranks.append(result)

    final = pd.concat(ranks).sort_index()

    return final


if __name__ == "__main__":

    from scipy.stats import rankdata, spearmanr
    from time import time

    np.random.seed(0)

    size = int(1e5), 5
    d1 = np.random.randint(5, size=size)
    d2 = np.random.randint(5, size=size)

    start = time()
    actual = spearman_row(d1, d2)
    end = time()
    print("actual", actual)
    print("rowbased took", end - start)

    start = time()
    expected = []
    for i in range(len(d1)):
        expected.append(spearmanr(d1[i], d2[i]).correlation)
    end = time()
    print("scipy took", end - start)

    expected = np.array(expected)

    print("largest diff", pd.Series(expected - actual).abs().max())

It prints:

rowbased took 3.6308434009552
scipy took 53.552557945251465
largest diff 2.220446049250313e-16

score 0 · Answer 3 · answered Sep 17 '18 at 16:27

0

Pandas has a corr function with the support of spearman. It works on columns, so we can just transpose the dataFrame.

We will append df1 to df2 and calculate the correlation by iterating over each row

len_df1 = df1.shape[0]
df2_index = df2.index.values.tolist()


df = df2.append(df1).reset_index(drop=True).T
values = {i: [df.iloc[:,df2_index+[i]].corr(method='spearman').values] for i in range(len_df1)}

answered Sep 17 '18 at 16:27

Raunaq Jain

917
7
13

I think there's a problem on the use of `df2.index` since it is made of strings, while in the last two statements I think it is the rank of the columns (due to `reset_index()`). Thanks for the suggestion however, I am trying to build from here. – gc5 Sep 17 '18 at 16:36
I don't have an idea about how the dataFrame looks like but you `reset_index()` once before the `df2_index` assignment and that should work. – Raunaq Jain Sep 17 '18 at 16:51