How to check correlation between matching columns of two data sets?

Question

If we have the data set:

import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})

How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?

The aim is to see correlations between the matching columns of the two datasets like this:

Are you aiming to get a single coefficient or 5 different coefficients? — ayhan, Dec 06 '16 at 21:13
I realize now my drawn picture was misleading. I'm looking to get a single coefficient between each of the matching columns of the different data sets — ishido, Dec 06 '16 at 21:17

score 3 · Answer 1 · edited May 23 '17 at 12:32

3

If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -

corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.

Sample run -

In [621]: a
Out[621]: 
    A   B   C   D   E
0  34  54  56   0  78
1  12  87  78  23  12
2  78  35   0  72  31
3  84  25  14  56   0
4  26  82  13  14  34

In [622]: b
Out[622]: 
    A   B   C    D   E
0  45  45  98    0  24
1  24  87  52   23  12
2  65  65  32    1  65
3  65  52  32  365   3
4  65  12  12   53  65

In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]: 
array([[ 0.71318502, -0.5923714 , -0.9704441 ,  0.48775228, -0.07401011],
       [ 0.0306753 , -0.0705457 ,  0.48801177,  0.34685977, -0.33942737],
       [-0.26626431, -0.01983468,  0.66110713, -0.50872017,  0.68350413],
       [ 0.58095645, -0.55231196, -0.32053858,  0.38416478, -0.62403866],
       [ 0.01652716,  0.14000468, -0.58238879,  0.12936016,  0.28602349]])

edited May 23 '17 at 12:32

Community

1
1

answered Dec 06 '16 at 21:47

Divakar

218,885
19
262
358

I'm actually considering changing it all to numpy. Next I want to actually do the correlations between 3 data sets where each column name has all three values on each axis. I think numpy will make that easier. Like this: http://seaborn.pydata.org/examples/network_correlations.html – ishido Dec 06 '16 at 21:50
Hi again, I've been using this solution a lot now, thank you. Have you done anything like this using Spearman's rank correlation instead of Pearson's r? – ishido Dec 09 '16 at 08:18
@ishido To my knowledge I haven't, sorry. – Divakar Dec 09 '16 at 11:13

blacksite · Accepted Answer · 2016-12-08T14:46:52.020

This achieves exactly what you want:

from scipy.stats import pearsonr

# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)

# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
    correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
    correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
    c.loc[col, col] = correl   # locate the diagonal for that column and assign the correlation coefficient

Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:

c = pd.DataFrame(columns = a.columns, index = a.columns)

for col in c.columns:
    for idx in c.index:
        correl_signif = pearsonr(a[col], b[idx])
        correl = correl_signif[0]
        c.loc[idx, col] = correl

c is now this:

Out[16]: 
           A          B         C         D          E
A   0.713185  -0.592371 -0.970444  0.487752 -0.0740101
B  0.0306753 -0.0705457  0.488012   0.34686  -0.339427
C  -0.266264 -0.0198347  0.661107  -0.50872   0.683504
D   0.580956  -0.552312 -0.320539  0.384165  -0.624039
E  0.0165272   0.140005 -0.582389   0.12936   0.286023

yes! Sorry, I edited the image I posted. Is it possible to do this including all the correlation coefficients? THe matrix you get is exactly the kind of thing I'm looking for. — ishido, Dec 06 '16 at 21:24

score 2 · Answer 3 · answered Dec 07 '16 at 00:23

I use this function that breaks it down with numpy

def corr_ab(a, b):

    a_ = a.values
    b_ = b.values
    ab = a_.T.dot(b_)
    n = len(a)

    sums_squared = np.outer(a_.sum(0), b_.sum(0))
    stds_squared = np.outer(a_.std(0), b_.std(0))

    return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
                        a.columns, b.columns)

demo

corr_ab(a, b)

score 0 · Answer 4 · answered Dec 06 '16 at 21:21

Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?

   import numpy
   X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
   Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
   for key,value in X.items():
        print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

How to check correlation between matching columns of two data sets?

4 Answers4

Linked