1

I am trying to hierarchically cluster a 2D numpy array so that it looks good when I graph it as a correlation matrix in d3.js.

My data looks like this:

[[ 1.   0.091  0.147 ..., -0.239  0.113  -0.012 ]
 [ 0.091  1.  -0.153 ..., -0.004 -0.244  -0.00520801]
 [ 0.147 -0.153  1.  ..., -0.157  0.013   0.133]
 ..., 
 [-0.239  -0.004 -0.157   ...,  -0.265  -0.362  1. ]]

I calculated these as Pearson correlation coefficients between -1 and 1. As you can see, there is a 1 to 1 correlation down the diagonal from top left of the array to bottom right.

If I graph these values, my correlation matrix looks like this:

correlation matrix before clustering

After clustering I want it to be somewhat similar to this, where the red colors represent postive correlations and blue represents negative correlations:

heat

Using matplotlib and scipy, I can cluster the coefficients to look like a heatmap, however, the values are changed. I want my values to remain the same.

I used this answer to graph the heatmap in python, but its not quite what I want since it changes my values.. All I need is to cluster the data and output to a csv/json file.

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

data_dist = pdist(final_correlation, 'correlation') # If I use this, 
# it gives me an array that is half the size of my original correlation matrix. These are 
# the distances. How do I use this to re-order my correlation matrix as a clustered matrix?


Out[1]: # The size is 9730, as opposed to the original size of 19,600
[ 0.612  0.503  1.653 ...,  0.792  1.577
0.829]

UPDATE If anyone knows R, the code I am trying to execute would probably look something like this

Community
  • 1
  • 1
achabacha322
  • 651
  • 10
  • 32

1 Answers1

1

Sorry for not giving a complete example, but I found a way to cluster the data, though not as nicely as I wanted it:

Suppose you have a csv file with correlations and a header row. You can copy the contents of the csv file and use this code:

import scipy.cluster.hierarchy as hc
import pandas
from matplotlib import pyplot

# copy the data to the clipboard first
d = pandas.read_clipboard(sep=",", index_col=0)
d.columns = [int(x) for x in d.columns]

link = hc.linkage(d.values, method='centroid')
o1 = hc.leaves_list(link)

mat = d.iloc[o1,:]
mat = mat.iloc[:, o1[::-1]]
pyplot.imshow(mat)

This will result in something like this: Imgur

The correlation values in the csv contain duplicate values, so you have to reverse the second part of the array.

achabacha322
  • 651
  • 10
  • 32