how to calculate correlation between rows in python pandas data frame

Question

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d'])
dfa.ix[0] = [2,6,8,12]
a   b   c   d
2.0 6.0 8.0 12.0
dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d'])
    a   b   c   d
0   2   6   8   12
1   1   3   4   6
2  -1  -3  -4  -6

I expect to get:

0    1
1    0.5
2   -0.5

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0)

But ll I get is a lot of Nan's

Can you please make clear the rows you are trying to find correlation between? — Yogesh, Nov 02 '17 at 12:12
Be sure to check out NumPy based one for performance - https://stackoverflow.com/a/47075690/. — Divakar, Nov 02 '17 at 12:52

seralouk · Accepted Answer · 2017-11-02T12:31:57.590

6

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1)

Results

0    1.0
1    1.0
2   -1.0
dtype: float64

edited Nov 02 '17 at 12:31

answered Nov 02 '17 at 12:20

seralouk

30,938
9
118
133

1

No need to use for loop, you can `dataframe.corrwith(series)`. I.e.: `dfb.corrwith(dfa.iloc[0], axis=1)` – FcoRodr Nov 02 '17 at 12:28
1

remove second solution, first is best – jezrael Nov 02 '17 at 12:30
of course you are right about the correlation numbers. My original task was to find the slope of the linear regression, but that takes too long, so I am looking for a faster solution. – Naomi Fridman Nov 02 '17 at 12:47

Yogesh · Answer 2 · 2018-12-05T05:05:06.557

3

I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

edited Dec 05 '18 at 05:05

answered Nov 02 '17 at 12:27

Yogesh

1,384
1
12
16

Thank you it's working. But I need only correlation between one row vs all the rest. Your solution calculates all row pairs correlations. – Naomi Fridman Nov 02 '17 at 12:44

Divakar · Answer 3 · 2017-11-02T13:04:14.103

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))

Sample run -

In [74]: dfa
Out[74]: 
     a    b    c     d
0  2.0  6.0  8.0  12.0

In [75]: dfb
Out[75]: 
   a  b  c   d
0  2  6  8  12
1  1  3  4   6
2 -1 -3 -4  -6

In [76]: pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
Out[76]: 
0    1.0
1    1.0
2   -1.0
dtype: float64

Runtime test

Case #1 : Large number of rows in dfb and 4 columns -

In [77]: dfa = pd.DataFrame(np.random.randint(1,100,(1,4)))

In [78]: dfb = pd.DataFrame(np.random.randint(1,100,(30000,4)))

# @sera's soln
In [79]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
1 loop, best of 3: 4.09 s per loop

In [80]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 1.53 ms per loop

Case #2 : Decent number of rows in dfb and 400 columns -

In [83]: dfa = pd.DataFrame(np.random.randint(1,100,(1,400)))

In [85]: dfb = pd.DataFrame(np.random.randint(1,100,(300,400)))

In [86]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
10 loops, best of 3: 44.8 ms per loop

In [87]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 635 µs per loop

how to calculate correlation between rows in python pandas data frame

3 Answers3