0

So, I have a set of texts I'd like to do some clustering analysis on. I've taken a Normalized Compression Distance between every text, and now I have basically built a complete graph with weighted edges that looks something like this:

text1, text2, 0.539
text2, text3, 0.675

I'm having tremendous difficulty figuring out the best way to plug this data into scipy's hierarchical clustering methods. I can probably convert the distance data into a table like the one on this page. How can I format this data so that it can easily be plugged into scipy's HAC code?

Jason Baker
  • 192,085
  • 135
  • 376
  • 510

1 Answers1

1

You're on the right track with converting the data into a table like the one on the linked page (a redundant distance matrix). According to the documentation, you should be able to pass that directly into scipy.cluster.hierarchy.linkage or a related function, such as scipy.cluster.hierarchy.single or scipy.cluster.hierarchy.complete. The related functions explicitly specify how distance between clusters should be calculated. scipy.cluster.hierarchy.linkage lets you specify whichever method you want, but defaults to single link (i.e. the distance between two clusters is the distance between their closest points). All of these methods will return a multidimensional array representing the agglomerative clustering. You can then use the rest of the scipy.cluster.hierarchy module to perform various actions on this clustering, such as visualizing or flattening it.

However, there's a catch. As of the time this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue is still open, I don't think this has been resolved yet. As pointed out in the answers to the linked question, you can get around this issue by passing the complete distance matrix into the scipy.spatial.distance.squareform function, which will convert it into the format which is actually accepted (a flat array containing the upper-triangular portion of the distance matrix, called a condensed distance matrix). You can then pass the result to one of the scipy.cluster.hierarchy functions.

Community
  • 1
  • 1
seaotternerd
  • 6,298
  • 2
  • 47
  • 58