1

I have a Pandas DataFrame that is generated from performing multiple correlations across variables.

corr = df.apply(lambda s: df.corrwith(s))
print('\n', 'Correlations')
print(corr.to_string())

The output looks like this:

 Correlations
        A         B           C          D          E
A   1.000000   -0.901104    0.662530  -0.772657   0.532606
B  -0.901104    1.000000   -0.380257   0.946223  -0.830466
C   0.662530   -0.380257    1.000000  -0.227531  -0.102506
D  -0.772657    0.946223   -0.227531   1.000000  -0.888768
E   0.532606   -0.830466   -0.102506  -0.888768   1.000000

However, this is a small sample of the correlation table, which can be over 300 rows x 300 cols. I'm trying to find a way to identify the coordinates for correlations within a specific value range.

For example, correlations between +0.25 and -0.25. My desired output would be:

E x C = -0.102506
D x C = -0.227531

In searching, I've found a few pandas functions that I'm unable to put together in a coherent way: pandas iloc, loc, pandas between

How would you suggest I go about accomplishing this filtering?

pepe
  • 9,799
  • 25
  • 110
  • 188
  • I'm unsure of what is missing from the results you found from your research. It makes me think there's more to the problem than you're stating – roganjosh Oct 26 '18 at 18:43
  • for example, it would be helpful to know if the pandas functions above are adequate for this purpose, or if there are any pointers in how to use them to resolve this problem, I have not found any so far – pepe Oct 26 '18 at 18:46

1 Answers1

1

Use masks + DataFrame.where. We'll use np.triu to get rid of duplicates since the correlation matrix is symmetric.

import numpy as np

corr.where(np.triu((corr.values <= 0.25) & (corr.values >= -0.25))).stack()

C  D   -0.227531
   E   -0.102506
dtype: float64
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • @pepe Right, sadly there is no `DataFrame.between` method, so you're stuck doing both. I guess you could use `np.logical_and`, but for 2 conditions it's not much cleaner. We get an array of True/False values with the same shape as the `DataFrame`. and then using `DataFrame.where` it leaves the `True` cells alone and fills the `False` cells with `NaN`. `.stack` then only leaves non-null cells once it's done. – ALollz Oct 26 '18 at 19:00