3

I have two DataFrames and I want to compute their correlations without looping:

import pandas as pd
df1 = pd.DataFrame({'A': range(0,4), 'B': range(14,10,-1)})
df2 = pd.DataFrame({'C': range(104,100,-1), 'D': range(2,6), 'E': range(11,7,-1)})
corr = pd.DataFrame(dict(c1=c1, **{c2:df2[c2].corr(df1[c1]) for c2 in df2.columns})
                    for c1 in df1.columns).set_index("c1")
corr.index.name = None

Now corr is

     C    D    E
A -1.0  1.0 -1.0
B  1.0 -1.0  1.0

Neither DataFrame.corr nor DataFrame.corrwith do what I need.

kaylum
  • 13,833
  • 2
  • 22
  • 31
sds
  • 58,617
  • 29
  • 161
  • 278
  • 1
    https://stackoverflow.com/questions/30143417/computing-the-correlation-coefficient-between-two-multi-dimensional-arrays – BENY Dec 18 '19 at 21:36
  • Wow, that link was news to me. I wonder why `pandas` favors the double loop in DataFrame.corr()? It it because it's a bit more free to deal with different methods, or is is just a memory concern once you're in the world of 50 columns and 40M+ rows? – ALollz Dec 18 '19 at 21:51

2 Answers2

3

You can use the methods apply and corrwith:

df2.apply(df1.corrwith)

Output:

     C    D    E
A -1.0  1.0 -1.0
B  1.0 -1.0  1.0
sds
  • 58,617
  • 29
  • 161
  • 278
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

Concatem:

pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).corr().loc['df1', 'df2']

     C    D    E
A -1.0  1.0 -1.0
B  1.0 -1.0  1.0
d_kennetz
  • 5,219
  • 5
  • 21
  • 44
  • 1
    Prettier than the other answer, thanks, but still imperfect in that it computes (n+m)^2 correlations instead of n*m correlations. – sds Dec 18 '19 at 21:43
  • 1
    (the other answer I refer to has now been deleted) – sds Dec 19 '19 at 14:52