9

I have a set of columns (col1,col2,col3) in dataframe df1 I have another set of columns (col4,col5,col6) in dataframe df2 Assume this two dataframes has the same number of rows.

How do I generate a correlation table that do pairwise correlation between df1 and df2?

the table will look like

    col1 col2 col3
col4 ..   ..   ..
col5 ..   ..   ..
col6 ..   ..   ..

I use df1.corrwith(df2), it does not seem to generate the table as required.

I have seen the answer at How to check correlation between matching columns of two data sets?, but the main difference is that the col names does not matched.

piRSquared
  • 285,575
  • 57
  • 475
  • 624

1 Answers1

25

pandas quick and dirty

pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']

numpy clean

def corr(df1, df2):
    n = len(df1)
    v1, v2 = df1.values, df2.values
    sums = np.multiply.outer(v2.sum(0), v1.sum(0))
    stds = np.multiply.outer(v2.std(0), v1.std(0))
    return pd.DataFrame((v2.T.dot(v1) - sums / n) / stds / n,
                        df2.columns, df1.columns)

corr(df1, df2)

example

df1 = pd.DataFrame(np.random.rand(10, 4), columns=list('abcd'))

df2 = pd.DataFrame(np.random.rand(10, 3), columns=list('xyz'))

pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']

          a         b         c         d
x  0.235624  0.844665 -0.647962  0.535562
y  0.357994  0.462007  0.205863  0.424568
z  0.688853  0.350318  0.132357  0.687038

corr(df1, df2)

          a         b         c         d
x  0.235624  0.844665 -0.647962  0.535562
y  0.357994  0.462007  0.205863  0.424568
z  0.688853  0.350318  0.132357  0.687038
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks, but I am comparing categorical fields instead, I decided to ask it as a new question, could you help take a look here, thanks in advance: http://stackoverflow.com/questions/41827716/how-to-perform-correlation-between-categorical-columns –  Jan 24 '17 at 12:00
  • @piRSquared Why do you say that the pandas solution is 'dirty'? I think it is a great one-liner that should be part of pandas package – tashuhka Dec 08 '17 at 09:50
  • @tashuhka you’re right. More quick and less dirty (-: – piRSquared Dec 08 '17 at 14:28
  • I tried to use this but I get KeyError: ('df1', 'df2') – Amine Kaddioui Aug 12 '18 at 18:32
  • 1
    I tried to use this but I get different correlation values from the actual/manual calculation in excel. – user2543 Sep 10 '21 at 07:59
  • I tried to use this but I get different correlation values from the actual/manual calculation in excel. I inserted ```df2.reset_index(drop=True, inplace=True)''' – user2543 Sep 10 '21 at 08:06