1

I've got too many features in a data frame. I'm trying to plot ONLY the features which are correlated over a certain threshold, let's say over 80%, and show those in a heatmap. I put some code together, and it runs, but I still see some white lines, which have no data, and thus no correlation. Also, I'm seeing things that are well under 80% correlation. Here is the code that I tried.

import seaborn
c = newdf.corr()
plt.figure(figsize=(10,10))
seaborn.heatmap(c, cmap='RdYlGn_r', mask = (np.abs(c) >= 0.8))
plt.show()

When I run that, I see this.

enter image description here

What is wrong here?

I am making a small update, with some new findings.

This gets ONLY corr>.8.

corr = newdf.corr()
kot = corr[corr>=.8]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Reds")

enter image description here

That seems to work, but it still gives me a lot of white! I thought there should be a way to include only the items that have a correlation over a certain amount. Maybe you have to copy those items with >.8 items to a new data frame and build the correlation off of that object. I'm not sure how this works.

ASH
  • 20,759
  • 19
  • 87
  • 200
  • The correlation of each column with itself is 1, so such filtering keeps all columns. Do you want to reorder the columns such that strongly correlated columns are close to each other? – Kate Melnykova Sep 23 '20 at 02:06
  • Yes, or at least I think so. Is there a way to copy all highly correlated features into a new data frame object, and then plot the heat map based on that object? – ASH Sep 23 '20 at 02:33
  • One way to achieve what you need is to rearrange the features, so the map above looks clustered. The example is here: https://stackoverflow.com/questions/2982929/plotting-results-of-hierarchical-clustering-ontop-of-a-matrix-of-data-in-python/3011894#3011894 – Kate Melnykova Sep 23 '20 at 02:43
  • Alternatively, if you want to split all your features into the groups, you may use the data structure: https://www.geeksforgeeks.org/union-find/ In your example, two nodes(features) are connected if their correlation is above 0.8 in magnitude. – Kate Melnykova Sep 23 '20 at 02:53
  • Thanks for both suggestions, but I don't see how either helps me. It's getting pretty late in NYC now. I'm tired; I'll revisit this tomorrow morning. As I think about it more, I think the best solution is to copy features with >.8% correlation into a new data frame, and do the plot based on that. I think that makes sense. I need to opine on it a little. – ASH Sep 23 '20 at 02:59

1 Answers1

1

The following code groups the strongly correlated features (with correlation above 0.8 in magnitude) into components and plots the correlation for each group of components individually. Please let me know if it differs from what you want.

components = list()
visited = set()
print(newdf.columns)
for col in newdf.columns:
    if col in visited:
        continue

    component = set([col, ])
    just_visited = [col, ]
    visited.add(col)
    while just_visited:
        c = just_visited.pop(0)
        for idx, val in corr[c].items():
            if abs(val) > 0.999 and idx not in visited:
                just_visited.append(idx)
                visited.add(idx)
                component.add(idx)
    components.append(component)

for component in components:
    plt.figure(figsize=(12,8))
    sns.heatmap(corr.loc[component, component], cmap="Reds")
Kate Melnykova
  • 1,863
  • 1
  • 5
  • 17
  • Thanks so much! When I run this I get the following error: ValueError: zero-size array to reduction operation minimum which has no identity When I look at the 'components' object, it seems to be a dict inside a list, and there are no values in there, so way to get a correlation. Did it actually work for you? – ASH Sep 23 '20 at 13:57
  • Oh, I forgot to mention that I changed 0.999 to 0.5. No values are getting appended into the list. – ASH Sep 23 '20 at 14:18
  • Hmm... Component should be always non-empty. As an easy fix, try to convert set component into the list, i.e., `components.append(list(component))` – Kate Melnykova Sep 23 '20 at 17:03
  • I made that change and I'm still getting the same error as before. – ASH Sep 23 '20 at 17:19
  • Oh, I see what happened. I had some NAN results in the corr. I replaced NAN with 0, and it's running! Thanks!! – ASH Sep 23 '20 at 17:38