How to plot heatmap for high-dimensional dataset?

Question

I would greatly appreciate if you could let me know how to plot high-resolution heatmap for a large dataset with approximately 150 features.

My code is as follows:

XX = pd.read_csv('Financial Distress.csv')

y = np.array(XX['Financial Distress'].values.tolist())
y = np.array([0 if i > -0.50 else 1 for i in y])
XX = XX.iloc[:, 3:87]
df=XX
df["target_var"]=y.tolist()
target_var=["target_var"]

fig, ax = plt.subplots(figsize=(8, 6))
correlation = df.select_dtypes(include=['float64',
                                             'int64']).iloc[:, 1:].corr()
sns.heatmap(correlation, ax=ax, vmax=1, square=True)
plt.xticks(rotation=90)
plt.yticks(rotation=360)
plt.title('Correlation matrix')
plt.tight_layout()
plt.show()
k = df.shape[1]  # number of variables for heatmap
fig, ax = plt.subplots(figsize=(9, 9))
corrmat = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corrmat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cols = corrmat.nlargest(k, target_var)[target_var].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.0)
hm = sns.heatmap(cm, mask=mask, cbar=True, annot=True,
                 square=True, fmt='.2f', annot_kws={'size': 7},
                 yticklabels=cols.values,
                 xticklabels=cols.
                 values)
plt.xticks(rotation=90)
plt.yticks(rotation=360)
plt.title('Annotated heatmap matrix')
plt.tight_layout()
plt.show()

It works fine but the plotted heatmap for a dataset with more than 40 features is too small.

Thanks in advance,

[Increase figsize or dpi?](https://stackoverflow.com/a/638443/8881141) — Mr. T, Jun 23 '18 at 10:44
@Mr.T Thanks a lot for your time and consideration. I tried 'figsize=(200, 200),dpi=150' but I don't know why it doesn't improve a lot. Some part of my dataset is here:https://www.kaggle.com/shebrahimi/financial-distress — ebrahimi, Jun 23 '18 at 12:39
@Mr.T Besides, I don't know how to work with plotly, if it is any useful. https://plot.ly/ — ebrahimi, Jun 23 '18 at 12:49
Did you try saving the figure, or just using plt.show()? What happens when you save as pdf? — Mark Teese, Jun 25 '18 at 06:48
Thanks for your time and consideration. I already tried fig.savefig('heat.png') and fig.savefig('heat.pdf') but it doesn't make no difference. @MarkTeese — ebrahimi, Jun 25 '18 at 08:20

Mark Teese · Answer 1 · 2018-07-27T08:35:32.220

Adjusting the figsize and dpi worked for me.

I adapted your code and doubled the size of the heatmap to 165 x 165. The rendering takes a while, but the png looks fine. My backend is "module://ipykernel.pylab.backend_inline."

As noted in my original answer, I'm pretty sure you forgot close the figure object before creating a new one. Try plt.close("all") before fig, ax = plt.subplots() if you get wierd effects.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

print(plt.get_backend())

# close any existing plots
plt.close("all")

df = pd.read_csv("Financial Distress.csv")
# select out the desired columns
df = df.iloc[:, 3:].select_dtypes(include=['float64','int64'])

# copy columns to double size of dataframe
df2 = df.copy()
df2.columns = "c_" + df2.columns
df3 = pd.concat([df, df2], axis=1)

# get the correlation coefficient between the different columns
corr = df3.iloc[:, 1:].corr()
arr_corr = corr.as_matrix()
# mask out the top triangle
arr_corr[np.triu_indices_from(arr_corr)] = np.nan

fig, ax = plt.subplots(figsize=(24, 18))

hm = sns.heatmap(arr_corr, cbar=True, vmin=-0.5, vmax=0.5,
                 fmt='.2f', annot_kws={'size': 3}, annot=True, 
                 square=True, cmap=plt.cm.Blues)

ticks = np.arange(corr.shape[0]) + 0.5
ax.set_xticks(ticks)
ax.set_xticklabels(corr.columns, rotation=90, fontsize=8)
ax.set_yticks(ticks)
ax.set_yticklabels(corr.index, rotation=360, fontsize=8)

ax.set_title('correlation matrix')
plt.tight_layout()
plt.savefig("corr_matrix_incl_anno_double.png", dpi=300)

full figure: zoom of top left section:

I am so grateful for your time and consideration. Sorry, but as I mentioned this dataset is approximately half of my data. Really, I just added some new features to it (approximately 50 more features I created for it), so still the heatmap is small. @MarkTeese Thanks a lot. — ebrahimi, Jun 29 '18 at 05:21
I've updated my answer to include a huge (165x165) heatmap from the supplied data. I would suggest running my code using the available csv from kaggle, and playing around with figsize and text sizes a bit more. — Mark Teese, Jun 29 '18 at 08:06
Could you please let me know if it is possible to plot a big picture like this as 12 separate pictures (each separate picture would show some part of the big picture)? Something like [this](https://stackoverflow.com/questions/4534480/get-legend-as-a-separate-picture-in-matplotlib). Thanks. — ebrahimi, Sep 19 '18 at 18:39

score 1 · Answer 2 · answered Jun 24 '18 at 18:42

1

If I understand your problem correctly, I think all you have to do is increase you figure size:

f, ax = plt.subplots(figsize=(20, 20))

instead of

f, ax = plt.subplots(figsize=(9, 9))

answered Jun 24 '18 at 18:42

wordsforthewise

13,746
5
87
117

I thank you very much. I tried to increase figsize and dpi but I couldn't manage to resolve the isuue. @wordsforthewise – ebrahimi Jun 24 '18 at 18:50
Then I guess the issue isn't clear. What do you mean by the 'heatmap is too small'? – wordsforthewise Jun 24 '18 at 20:09
Sorry, I inserted the plotted heatmap to show what I mean by too small. Thanks. @wordsforthewise – ebrahimi Jun 25 '18 at 00:32
Besides, if it is not possible to plot heatmap for about 150 features, Could you please let me know if it is possible to plot it just for the most correlated features? Is it a well-accepted procedure in the community to do so? What they do in case of high dimensional data? – ebrahimi Jun 25 '18 at 00:44
Yes, you could filter correlations with `df.corr() > 0.5` or something similar. I'm not sure of any 'best practices', but I'd probably only look at the top most-correlated features in your case. It depends on what you want to do. – wordsforthewise Jun 25 '18 at 15:40
Thanks a lot for your time and consideration. I want to do classification using logistic regression. I decide to use L1 in order to select the most relevant features, so I need to plot the heatmap as exploratory data analysis. Since there is a lot of features, it is not also possible to plot paireplot. @wordsforthewise – ebrahimi Jun 25 '18 at 17:09
Sorry, I used abs(df.corr()) > 0.5 but it does not have any effect. @wordsforthewise – ebrahimi Jun 25 '18 at 17:34
I guess you want to see which features most correlate to the target, in that case, I'd split up the features into multiple groups of something like 20-30 and plot correlation maps of them with the target. Or use df.corr() to find features that correlate most highly with the target. – wordsforthewise Jun 25 '18 at 18:07
Could you please let me know if it is possible to split up the features into two parts and then, plot three heatmaps e.g., corr-upper-triangle = df.corr().iloc[1:50,1:50], corr-right-triangle = df.corr().iloc[51:100,51:100], and corr-middle-square= df.corr().iloc[51:100,1:50]?@ wordsforthewise Thanks a lot – ebrahimi Jun 30 '18 at 14:37
Sure, you could do that, but you'll be missing correlations between lots of the variables. This is a case where it's probably too high of a dimension for plotting and I would use the raw numbers instead. If your goal is to find high correlations, then just filter the entries by which ones have high correlations. – wordsforthewise Jul 01 '18 at 23:27

How to plot heatmap for high-dimensional dataset?

2 Answers2

Linked