8

I would greatly appreciate if you could let me know how to plot high-resolution heatmap for a large dataset with approximately 150 features.

My code is as follows:

XX = pd.read_csv('Financial Distress.csv')

y = np.array(XX['Financial Distress'].values.tolist())
y = np.array([0 if i > -0.50 else 1 for i in y])
XX = XX.iloc[:, 3:87]
df=XX
df["target_var"]=y.tolist()
target_var=["target_var"]

fig, ax = plt.subplots(figsize=(8, 6))
correlation = df.select_dtypes(include=['float64',
                                             'int64']).iloc[:, 1:].corr()
sns.heatmap(correlation, ax=ax, vmax=1, square=True)
plt.xticks(rotation=90)
plt.yticks(rotation=360)
plt.title('Correlation matrix')
plt.tight_layout()
plt.show()
k = df.shape[1]  # number of variables for heatmap
fig, ax = plt.subplots(figsize=(9, 9))
corrmat = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corrmat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cols = corrmat.nlargest(k, target_var)[target_var].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.0)
hm = sns.heatmap(cm, mask=mask, cbar=True, annot=True,
                 square=True, fmt='.2f', annot_kws={'size': 7},
                 yticklabels=cols.values,
                 xticklabels=cols.
                 values)
plt.xticks(rotation=90)
plt.yticks(rotation=360)
plt.title('Annotated heatmap matrix')
plt.tight_layout()
plt.show()

It works fine but the plotted heatmap for a dataset with more than 40 features is too small. enter image description here

Thanks in advance,

ebrahimi
  • 912
  • 2
  • 13
  • 32
  • 1
    [Increase figsize or dpi?](https://stackoverflow.com/a/638443/8881141) – Mr. T Jun 23 '18 at 10:44
  • @Mr.T Thanks a lot for your time and consideration. I tried 'figsize=(200, 200),dpi=150' but I don't know why it doesn't improve a lot. Some part of my dataset is here:https://www.kaggle.com/shebrahimi/financial-distress – ebrahimi Jun 23 '18 at 12:39
  • @Mr.T Besides, I don't know how to work with plotly, if it is any useful. https://plot.ly/ – ebrahimi Jun 23 '18 at 12:49
  • Did you try saving the figure, or just using plt.show()? What happens when you save as pdf? – Mark Teese Jun 25 '18 at 06:48
  • Thanks for your time and consideration. I already tried fig.savefig('heat.png') and fig.savefig('heat.pdf') but it doesn't make no difference. @MarkTeese – ebrahimi Jun 25 '18 at 08:20

2 Answers2

3

Adjusting the figsize and dpi worked for me.

I adapted your code and doubled the size of the heatmap to 165 x 165. The rendering takes a while, but the png looks fine. My backend is "module://ipykernel.pylab.backend_inline."

As noted in my original answer, I'm pretty sure you forgot close the figure object before creating a new one. Try plt.close("all") before fig, ax = plt.subplots() if you get wierd effects.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

print(plt.get_backend())

# close any existing plots
plt.close("all")

df = pd.read_csv("Financial Distress.csv")
# select out the desired columns
df = df.iloc[:, 3:].select_dtypes(include=['float64','int64'])

# copy columns to double size of dataframe
df2 = df.copy()
df2.columns = "c_" + df2.columns
df3 = pd.concat([df, df2], axis=1)

# get the correlation coefficient between the different columns
corr = df3.iloc[:, 1:].corr()
arr_corr = corr.as_matrix()
# mask out the top triangle
arr_corr[np.triu_indices_from(arr_corr)] = np.nan

fig, ax = plt.subplots(figsize=(24, 18))

hm = sns.heatmap(arr_corr, cbar=True, vmin=-0.5, vmax=0.5,
                 fmt='.2f', annot_kws={'size': 3}, annot=True, 
                 square=True, cmap=plt.cm.Blues)

ticks = np.arange(corr.shape[0]) + 0.5
ax.set_xticks(ticks)
ax.set_xticklabels(corr.columns, rotation=90, fontsize=8)
ax.set_yticks(ticks)
ax.set_yticklabels(corr.index, rotation=360, fontsize=8)

ax.set_title('correlation matrix')
plt.tight_layout()
plt.savefig("corr_matrix_incl_anno_double.png", dpi=300)

full figure: corr_matrix_anno_double_image zoom of top left section: zoom_of_top_end_image

Mark Teese
  • 651
  • 5
  • 16
  • I am so grateful for your time and consideration. Sorry, but as I mentioned this dataset is approximately half of my data. Really, I just added some new features to it (approximately 50 more features I created for it), so still the heatmap is small. @MarkTeese Thanks a lot. – ebrahimi Jun 29 '18 at 05:21
  • I've updated my answer to include a huge (165x165) heatmap from the supplied data. I would suggest running my code using the available csv from kaggle, and playing around with figsize and text sizes a bit more. – Mark Teese Jun 29 '18 at 08:06
  • Could you please let me know if it is possible to plot a big picture like this as 12 separate pictures (each separate picture would show some part of the big picture)? Something like [this](https://stackoverflow.com/questions/4534480/get-legend-as-a-separate-picture-in-matplotlib). Thanks. – ebrahimi Sep 19 '18 at 18:39
1

If I understand your problem correctly, I think all you have to do is increase you figure size:

f, ax = plt.subplots(figsize=(20, 20))

instead of

f, ax = plt.subplots(figsize=(9, 9))
wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
  • I thank you very much. I tried to increase figsize and dpi but I couldn't manage to resolve the isuue. @wordsforthewise – ebrahimi Jun 24 '18 at 18:50
  • Then I guess the issue isn't clear. What do you mean by the 'heatmap is too small'? – wordsforthewise Jun 24 '18 at 20:09
  • Sorry, I inserted the plotted heatmap to show what I mean by too small. Thanks. @wordsforthewise – ebrahimi Jun 25 '18 at 00:32
  • Besides, if it is not possible to plot heatmap for about 150 features, Could you please let me know if it is possible to plot it just for the most correlated features? Is it a well-accepted procedure in the community to do so? What they do in case of high dimensional data? – ebrahimi Jun 25 '18 at 00:44
  • Yes, you could filter correlations with `df.corr() > 0.5` or something similar. I'm not sure of any 'best practices', but I'd probably only look at the top most-correlated features in your case. It depends on what you want to do. – wordsforthewise Jun 25 '18 at 15:40
  • Thanks a lot for your time and consideration. I want to do classification using logistic regression. I decide to use L1 in order to select the most relevant features, so I need to plot the heatmap as exploratory data analysis. Since there is a lot of features, it is not also possible to plot paireplot. @wordsforthewise – ebrahimi Jun 25 '18 at 17:09
  • Sorry, I used abs(df.corr()) > 0.5 but it does not have any effect. @wordsforthewise – ebrahimi Jun 25 '18 at 17:34
  • I guess you want to see which features most correlate to the target, in that case, I'd split up the features into multiple groups of something like 20-30 and plot correlation maps of them with the target. Or use df.corr() to find features that correlate most highly with the target. – wordsforthewise Jun 25 '18 at 18:07
  • Could you please let me know if it is possible to split up the features into two parts and then, plot three heatmaps e.g., corr-upper-triangle = df.corr().iloc[1:50,1:50], corr-right-triangle = df.corr().iloc[51:100,51:100], and corr-middle-square= df.corr().iloc[51:100,1:50]?@ wordsforthewise Thanks a lot – ebrahimi Jun 30 '18 at 14:37
  • Sure, you could do that, but you'll be missing correlations between lots of the variables. This is a case where it's probably too high of a dimension for plotting and I would use the raw numbers instead. If your goal is to find high correlations, then just filter the entries by which ones have high correlations. – wordsforthewise Jul 01 '18 at 23:27