2

I am:

Then I want to:

  • (B) do the same using a couple of DataFrames with my own data.

I supply the distance correlation to sns.clustermap directly, as done in the documentation example, because I am interested in the structure in the heatmap, as opposed to using the Distance Correlation matrix to calculate the linkage, as done in this SO answer, for example. I create the distance correaltion matrix with a modification of code from this excellent SO answer.

  • (A) No issues here

As I execute:

distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
dcor_df= df.apply(lambda col1: df.apply(lambda col2: distcorr(col1, col2)))
sns.clustermap(dcor_df, cmap="mako",
               row_colors=network_colors, col_colors=network_colors,
               linewidths=.75, figsize=(13, 13))

I get the result I expected:

enter image description here

  • (B) I do encounter issues here, as I move to my own data

For some background: I have two DataFrames with variables labeled A, B, ..., P in both. The variables are identical (same measurement, same units), but the measurements were collected in two locations that are spatially separated, hence my goal was to run the analysis separately, to see if the variables correlate in a similar way (i.e. with similar structure in the heatmap) in different locations.

Data from the first location is stored in here.

I execute the following code:

df_1 = pd.read_csv('df_1.csv')
pd.options.display.float_format = '{:,.3f}'.format
distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt_1 = df_1.apply(lambda col1: df_1.apply(lambda col2: distcorr(col1, col2)))
rslt_1

and I get the expected (square, symmetric) Distance Correlation matrix: enter image description here which I can plot with sns.heatmap as:

h=sns.heatmap(rslt_1, cmap="mako",  vmin=0, vmax=1, 
              xticklabels=True, yticklabels=True, square=True) 
fig = plt.gcf()
fig.set_size_inches(14, 10)

enter image description here

However, when I try to pass the Distance Correlation matrix to `sns.clustermap' with:

s=sns.clustermap(rslt_1, cmap="mako", standard_scale=1, linewidths=0) 
fig = plt.gcf()
fig.set_size_inches(10, 10);

I get this:

enter image description here

which is very weird to me because I'm expecting the same ordering on both rows and columns as in the above modified documentation example. Unless I'm totally out to lunch and am missing or misunderstand something important.

If I pass metric='correlation' like this:

s=sns.clustermap(rslt_1, cmap="mako", metric='correlation', 
             standard_scale=1, linewidths=0) 
fig = plt.gcf()
fig.set_size_inches(10, 10);

I get a result that is symmetric about the diagonal as I expected, and if I 'eyeball' those clusters they make more sense to me when I compare to the matrix in tabular form: enter image description here

With the data from the second location, which is stored here, I get reasonable results (and fairly similar, although not identical) whether I pass metric='correlation' or not: final

I cannot explain the behavior in the first case. Am I missing something?

Thank you.

PS I am on a Windows 10 PC. Some info:

enter image description here enter image description here

MyCarta
  • 808
  • 2
  • 12
  • 37

1 Answers1

3

Remove the standard scale parameter from your clustermap.

According to the seaborn documentation (seaborn.pydata.org/generated/seaborn.clustermap.html), the standard_scale=1 parameter standardize the column dimension, which means subtracting the minimum and divide each by its maximum. Your data matrix passed to the clustermap function looks like to be already between [0 1].

s=sns.clustermap(rslt_1, cmap="mako", metric='correlation', linewidths=0)

Clustering is basically grouping data based on relationships among the variables in the data. Clustering algorithms help in getting structured data in unsupervised learning. The most common types of clustering are shown in the picture below.

Clustering types

The clustermap() function of seaborn plots a hierarchically-clustered heat map of the given matrix dataset. It returns a clustered grid index.

In Agglomerative clustering, we start with considering each data point as a cluster and then repeatedly combine two nearest clusters into larger clusters until we are left with a single cluster. The graph plotted after performing agglomerative clustering on data is called "Dendrogram".

Clustering obviously re-order your data.

Ouatataz
  • 185
  • 3
  • 12
  • thank you. Can you expand on your answer to elaborate on that? It does seem to resolve the issue but why? Standardizing is suppose to normalize to [0 1] range as far as I can tell from documentation, but is also changing the order expected behavior? – MyCarta Mar 27 '20 at 15:44
  • 1
    I am not expert in data science, but according to the seaborn documentation (https://seaborn.pydata.org/generated/seaborn.clustermap.html), the standard scale parameter = 1 standardize the column dimension, which means subtracting the minimum and divide each by its maximum. Your data matrix passed to the clustermap function looks like to be already between [0 1]. – Ouatataz Mar 27 '20 at 16:23
  • That still does not convince me that re-ordering is intended behavior. However, since your suggestion does solve the problem in the question, if you edit the answer to incorporate a bit more detail (it may come handy to others in the future) I will choose it. It's upvoted for now. Thank you – MyCarta Mar 30 '20 at 22:13
  • I have not looked at this in a while. I would like to choose your answer, would you expend it a bit; others may benefit in the future. Thanks – MyCarta May 28 '21 at 20:08