2

How can I get an output that would list only the variables whose absolute value correlation is greater than .7?

I would like output similar to this:

four: one, three
one: three

Thanks for your time!

Code

import pandas as pd

x={'one':[1,2,3,4],'two':[3,5,7,5],'three':[2,3,4,9],'four':[4,3,1,0],}
y=pd.DataFrame(x)
print(y.corr())

Output

           four       one     three       two
four   1.000000 -0.989949 -0.880830 -0.670820
one   -0.989949  1.000000  0.913500  0.632456
three -0.880830  0.913500  1.000000  0.262613
two   -0.670820  0.632456  0.262613  1.000000
Psidom
  • 209,562
  • 33
  • 339
  • 356
Daniel
  • 37
  • 1
  • 5

2 Answers2

2

If all you want is to print it out, this will work:

col_names = y.corr().columns.values

for col, row in (y.corr().abs() > 0.7).iteritems():
    print(col, col_names[row.values])

Note that this works but it might be slow because the iteritems method converts each row into a series.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Thank you juanpa.arrivillaga. – Daniel Jun 09 '16 at 03:03
  • I tried to edit your code to exclude r=1, but I got an error message. How would you edit this code? for col, row in (y.corr().abs() > 0.7).iteritems(): This didn't work for me for col, row in (y.corr().abs() > 0.7 and y.corr().abs()<1).iteritems(): – Daniel Jun 09 '16 at 03:11
  • Ah. So, if you want to do *element-wise* logical operations with pandas objects you need to use `&` for `and` and `|` for `or`. So this should work: `(y.corr().abs() > 0.7) & (y.corr().abs() < 1)`. See this answer: http://stackoverflow.com/questions/21415661/logic-operator-for-boolean-indexing-in-pandas – juanpa.arrivillaga Jun 09 '16 at 03:24
  • Thanks again! You solved my problem and I also learned from what you did. I'll be sure to pay it forward. – Daniel Jun 09 '16 at 03:33
1

This works for me:

corr = y.corr().unstack().reset_index() #group together pairwise
corr.columns = ['var1','var2','corr'] #rename columns to something readable
print( corr[ corr['corr'].abs() > 0.7 ] ) #keep correlation results above 0.7

You could further exclude variables with the same name (corr = 1) by changing the last line to

print( corr[ (corr['corr'].abs() > 0.7) & (corr['var1'] != corr['var2']) ] )
scottlittle
  • 18,866
  • 8
  • 51
  • 70