1

When performing hierarchical clustering with scipy, it is said in the docs here that scipy.cluster.hierarchy.linkage takes 1-D condensed distance matrix or a 2-D array of observation vectors as input. However, I generated a simple (symmetric) similarity matrix with pandas Dataframe and scipy took that as input with no problem at all, and the resulting dendrogram is just fine.

Can someone explain, how is this possible? Do I have outdated docs or...?

PasHyde
  • 11
  • 1
  • 1
    It treated your square `n`-by-`n` matrix as `n` points each with length `n`. You won't get an error, but the output is not meaningful. See ["scipy.cluster.hierarchy: labels seems not in the right order, and confused by the value of the vertical axes"](https://stackoverflow.com/questions/40700628/scipy-cluster-hierarchy-labels-seems-not-in-the-right-order-and-confused-by-th) and ["How does condensed distance matrix work? (pdist)"](https://stackoverflow.com/questions/13079563/how-does-condensed-distance-matrix-work-pdist). – Warren Weckesser Apr 06 '21 at 16:10

1 Answers1

0

The docs are accurate, they just don't tell you what will happen if you actually try to use an uncondensed distance matrix.

The function raises a warning but still runs because it first tries to convert input into a numpy array. This creates a 2-D array from your 2-D DataFrame while at the same time recognizing that this likely isn't the expected input based on the array dimensions and symmetry.

Depending on the complexity (e.g. cluster separation, number of clusters, distribution of data across clusters) of your input data, the clustering may still look like it succeeds in generating a suitable dendrogram, as you noted. This makes sense conceptually because the result is a clustering of n- similarity vectors which may be well-separated in simple cases.

For example, here is some synthetic data with 150 observations and 2 clusters:

import pandas as pd
from scipy.spatial.distance import cosine, pdist, squareform

np.random.seed(42)  # for repeatability
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
obs_df = pd.DataFrame(np.concatenate((a, b),), columns=['x', 'y'])
obs_df.plot.scatter(x='x', y='y')

Z = linkage(obs_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)

two clusters two cluster dendrogram

If you generate a similarity matrix, this is an n x n matrix that could still be clustered as if it were n vectors. I can't plot 150-D vectors, but plotting the magnitude of each vector and then the dendrogram seems to confirm a similar clustering.

def similarity_func(u, v):
    return 1-cosine(u, v)

dists = pdist(obs_df, similarity_func)
sim_df = pd.DataFrame(squareform(dists), columns=obs_df.index, index=obs_df.index)
sim_array = np.asarray(sim_df)
sim_lst = []
for vec in sim_array:
    mag = np.linalg.norm(vec,ord=1)
    sim_lst.append(mag)
pd.Series(sim_lst).plot.bar()

Z = linkage(sim_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)

similarity vector magnitudes similarity vector dendrogram

What we're really clustering here is a vector whose components are similarity measures of each of the 150 points. We're clustering a collection of each point's intra- and inter-cluster similarity measures. Since the two clusters are different sizes, a point in one cluster will have a rather different collection of intra- and inter-cluster similarities relative to a point in the other cluster. Hence, we get two primary clusters that are proportionate to the number of points in each cluster just as we did in the first step.

Frodnar
  • 2,129
  • 2
  • 6
  • 20
  • Ok, brilliant, this clarifies the issue! So, there is no escaping, the matrix must be condensed. Thank you! – PasHyde Apr 07 '21 at 10:37