7

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d'])
dfa.ix[0] = [2,6,8,12]
a   b   c   d
2.0 6.0 8.0 12.0
dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d'])
    a   b   c   d
0   2   6   8   12
1   1   3   4   6
2  -1  -3  -4  -6

I expect to get:

0    1
1    0.5
2   -0.5

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0)

But ll I get is a lot of Nan's

Divakar
  • 218,885
  • 19
  • 262
  • 358
Naomi Fridman
  • 2,095
  • 2
  • 25
  • 36

3 Answers3

6

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1)

Results

0    1.0
1    1.0
2   -1.0
dtype: float64
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • 1
    No need to use for loop, you can `dataframe.corrwith(series)`. I.e.: `dfb.corrwith(dfa.iloc[0], axis=1)` – FcoRodr Nov 02 '17 at 12:28
  • 1
    remove second solution, first is best – jezrael Nov 02 '17 at 12:30
  • of course you are right about the correlation numbers. My original task was to find the slope of the linear regression, but that takes too long, so I am looking for a faster solution. – Naomi Fridman Nov 02 '17 at 12:47
3

I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

Yogesh
  • 1,384
  • 1
  • 12
  • 16
  • Thank you it's working. But I need only correlation between one row vs all the rest. Your solution calculates all row pairs correlations. – Naomi Fridman Nov 02 '17 at 12:44
2

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))

Sample run -

In [74]: dfa
Out[74]: 
     a    b    c     d
0  2.0  6.0  8.0  12.0

In [75]: dfb
Out[75]: 
   a  b  c   d
0  2  6  8  12
1  1  3  4   6
2 -1 -3 -4  -6

In [76]: pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
Out[76]: 
0    1.0
1    1.0
2   -1.0
dtype: float64

Runtime test

Case #1 : Large number of rows in dfb and 4 columns -

In [77]: dfa = pd.DataFrame(np.random.randint(1,100,(1,4)))

In [78]: dfb = pd.DataFrame(np.random.randint(1,100,(30000,4)))

# @sera's soln
In [79]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
1 loop, best of 3: 4.09 s per loop

In [80]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 1.53 ms per loop

Case #2 : Decent number of rows in dfb and 400 columns -

In [83]: dfa = pd.DataFrame(np.random.randint(1,100,(1,400)))

In [85]: dfb = pd.DataFrame(np.random.randint(1,100,(300,400)))

In [86]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
10 loops, best of 3: 44.8 ms per loop

In [87]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 635 µs per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358