You're on the right track with converting the data into a table like the one on the linked page (a redundant distance matrix). According to the documentation, you should be able to pass that directly into scipy.cluster.hierarchy.linkage
or a related function, such as scipy.cluster.hierarchy.single
or scipy.cluster.hierarchy.complete
. The related functions explicitly specify how distance between clusters should be calculated. scipy.cluster.hierarchy.linkage
lets you specify whichever method you want, but defaults to single link (i.e. the distance between two clusters is the distance between their closest points). All of these methods will return a multidimensional array representing the agglomerative clustering. You can then use the rest of the scipy.cluster.hierarchy
module to perform various actions on this clustering, such as visualizing or flattening it.
However, there's a catch. As of the time this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue is still open, I don't think this has been resolved yet. As pointed out in the answers to the linked question, you can get around this issue by passing the complete distance matrix into the scipy.spatial.distance.squareform
function, which will convert it into the format which is actually accepted (a flat array containing the upper-triangular portion of the distance matrix, called a condensed distance matrix). You can then pass the result to one of the scipy.cluster.hierarchy
functions.