1

I am working with a set of species counts (counts) from several different sample stations (stations). I have calculated the Bray-Curtis similarity between every possible pair of sample stations using the pw_distance function from scikit-bio. This produces a distance matrix with values bounded between 0 and 1. So far so good.

I want to use that distance matrix to produce a dendrogram showing how the sample stations cluster together. I am doing this using scipy's hierachy.linkage function to find the linkages for the dendrogram, and then plotting with hierarchy.dendrogram.

Here's my code:

from skbio.diversity.beta import pw_distances
from scipy.cluster import hierarchy

bc_dm = pw_distances(counts, stations, metric = "braycurtis")

# use (1 - bc_dm) to get similarity rather than dissimilarity
sim = 1 - bc_dm.data

Z = hierarchy.linkage(sim, 'ward')
hierarchy.dendrogram(
    Z,
    leaf_rotation=0.,  # rotates the x axis labels
    leaf_font_size=10.,  # font size for the x axis labels
    labels=bc_dm.ids,
    orientation="left"
)

here is a link to the dendrogram produced by the above code

As I understand it, the distance on the dendrogram should correspond to the Bray-Curtis similarity (analogous to a distance), but the distance values on my dendrogram reach a maximum of over 30. Is this correct? If not, how can I scale my distances to correspond to the Bray-Curtis similarity between sample stations? If it is correct, what do the distances on teh dendrogram really correspond to?

  • See https://stackoverflow.com/questions/40700628/scipy-cluster-hierarchy-labels-seems-not-in-the-right-order-and-confused-by-th/40707534#40707534, or https://stackoverflow.com/questions/48331537/label-ordering-in-scipy-dendrogram/48331999#48331999, or https://stackoverflow.com/questions/41416498/dendrogram-or-other-plot-from-distance-matrix/41418241#41418241, or ... (it is a common issue, reflecting a confusing API). – Warren Weckesser Feb 05 '18 at 23:16

1 Answers1

1

See the links shared in the comments as they address your questions.

One scikit-bio step that isn't covered in those links is that you should call linkage on bc_dm.condensed_form(), rather than on bc_dm or sim. This will get you the input in the format that you need. If you pass a 2D matrix, linkage assumes that it's your counts matrix, and is computing Euclidean distances between your samples based on those data.

Also, be sure to pay attention to the method parameter to scipy.cluster.hierarchy.linkage as that will impact the interpretation of the branch lengths in your dendrogram. The doc string for scipy.cluster.hierarchy.linkage contains details on how these are computed for the different methods.

gregcaporaso
  • 444
  • 4
  • 11