I am working with a set of species counts (counts) from several different sample stations (stations). I have calculated the Bray-Curtis similarity between every possible pair of sample stations using the pw_distance function from scikit-bio. This produces a distance matrix with values bounded between 0 and 1. So far so good.
I want to use that distance matrix to produce a dendrogram showing how the sample stations cluster together. I am doing this using scipy's hierachy.linkage function to find the linkages for the dendrogram, and then plotting with hierarchy.dendrogram.
Here's my code:
from skbio.diversity.beta import pw_distances
from scipy.cluster import hierarchy
bc_dm = pw_distances(counts, stations, metric = "braycurtis")
# use (1 - bc_dm) to get similarity rather than dissimilarity
sim = 1 - bc_dm.data
Z = hierarchy.linkage(sim, 'ward')
hierarchy.dendrogram(
Z,
leaf_rotation=0., # rotates the x axis labels
leaf_font_size=10., # font size for the x axis labels
labels=bc_dm.ids,
orientation="left"
)
here is a link to the dendrogram produced by the above code
As I understand it, the distance on the dendrogram should correspond to the Bray-Curtis similarity (analogous to a distance), but the distance values on my dendrogram reach a maximum of over 30. Is this correct? If not, how can I scale my distances to correspond to the Bray-Curtis similarity between sample stations? If it is correct, what do the distances on teh dendrogram really correspond to?