10

I am doing a hierarchical clustering a 2 dimensional matrix by correlation distance metric (i.e. 1 - Pearson correlation). My code is the following (the data is in a variable called "data"):

from hcluster import *

Y = pdist(data, 'correlation')
cluster_type = 'average'
Z = linkage(Y, cluster_type)
dendrogram(Z)

The error I get is:

ValueError: Linkage 'Z' contains negative distances. 

What causes this error? The matrix "data" that I use is simply:

[[  156.651968  2345.168618]
 [  158.089968  2032.840106]
 [  207.996413  2786.779081]
 [  151.885804  2286.70533 ]
 [  154.33665   1967.74431 ]
 [  150.060182  1931.991169]
 [  133.800787  1978.539644]
 [  112.743217  1478.903191]
 [  125.388905  1422.3247  ]]

I don't see how pdist could ever produce negative numbers when taking 1 - pearson correlation. Any ideas on this?

thank you.

2 Answers2

5

There are some lovely floating point problems going on. If you look at the results of pdist, you'll find there are very small negative numbers (-2.22044605e-16) in them. Essentially, they should be zero. You can use numpy's clip function to deal with it if you would like.

Justin Peel
  • 46,722
  • 6
  • 58
  • 80
  • I tried the following but it did not work: # compute Y from pdist using 'correlation' Y = clip(Y, 0, 1) and the clusterings I get for the matrix I showed above are very weird. Any idea what might be happening? This only happens with 'correlation' as the argument to pdist. –  May 31 '10 at 03:42
  • 1
    You could try using something like `Y[abs(Y)<3e-16] = 0.0' instead because you also have some very small positive distances. Sometimes numbers like that can really throw things off. I don't have much experience using the clustering module quite frankly. It could have to do with using 'average' for the cluster type maybe? – Justin Peel May 31 '10 at 04:20
3

If you were getting error

KeyError: -428

and your code was on the lines of

import matplotlib.pyplot as plt
import matplotlib as mpl

%matplotlib inline 
from scipy.cluster.hierarchy import ward, dendrogram

linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(35, 20),dpi=400) # set size
ax = dendrogram(linkage_matrix, orientation="right",labels=queries);

` It is due to the mismatch in indexes of queries.

Might want to update to

ax = dendrogram(linkage_matrix, orientation="right",labels=list(queries));
Ronak Agrawal
  • 1,006
  • 1
  • 19
  • 48