2

I have 100 Variables and I am trying to plot these variables in a correlation matrix. However as you can see on the picture I have too many variables to have a good visual presentation. Is there a graphical presentation which only shows the relevant correlated variables like above the threshold of 0.5?

This is the code I used:

import numpy as np # Data manipulation  
import matplotlib.pyplot as plt 
import seaborn c = df_train_new.corr() 
plt.figure(figsize=(20,20)) 
seaborn.heatmap(c, cmap='RdYlGn_r', mask = (np.abs(c) >= 0.5)) 
plt.show()

Correlation Matrix

petezurich
  • 9,280
  • 9
  • 43
  • 57

1 Answers1

1

Remove the diagonal duplication. Remove highly correlated pairs. drop columns > .95

mask=np.triu(np.ones_like(corr,dtype=bool))

sns.heatmap(corr,  cmap=cmap, center=0, linewidths=1, annot=True, fmt=".2f",mask=mask)
plt.show()

removing highly correlated features


-1 and 1 and 0

drop features that are close to 1 or -1

tri_df=corr_matrix.mask(mask)

to_drop=[c for c in tri_df.columns if any(tri_df[c]>0.95)]
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
  • Unfortunately, there are two variables which are perfect correlated to each other, meaning I need to find a way where the variables with the same name needs to be excluded but not all others. With the method above, I exclude the two perfect correlated variables and would miss that these are perfectly correlated. – Student Guess Jun 10 '22 at 05:26
  • you could create your own mask and include the two need variables. exclude the other columns that are highly correlate meaning they add very little to the outcome of the prediction. – Golden Lion Jun 10 '22 at 14:13