I have a distance matrix n*n M
where M_ij
is the distance between object_i
and object_j
. So as expected, it takes the following form:
/ 0 M_01 M_02 ... M_0n\
| M_10 0 M_12 ... M_1n |
| M_20 M_21 0 ... M2_n |
| ... |
\ M_n0 M_n2 M_n2 ... 0 /
Now I wish to cluster these n objects with hierarchical clustering. Python has an implementation of this called scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean')
.
Its documentation says:
y must be a {n \choose 2} sized vector where n is the number of original observations paired in the distance matrix.
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.
I am confused by this description of y
. Can I directly feed my M
in as the input y
?
Update
@hongbo-zhu-cn has raised this issue up in GitHub. This is exactly what I am concerning about. However, as a newbie to GitHub, I don't know how it works and therefore have no idea how this issue is dealt with.