2

I am new to plotly and need to draw a dendrogram with group average linkage.

I am aware that there is a distfun parameter in create_dendrogram(), but I have no idea what to pass to that argument to get Group Average Linkage. The distfun argument apparently have to be callable. What function should I pass to it?

As a sidenote, I have a sample pairwise distance matrix 0 13 0 2 14 0 17 1 18 0 which, when I passed to the create_dendrogram() method, seems to produce an incorrect result. What am I doing wrong here?

code:

import plotly.figure_factory as ff

import numpy as np

X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])

names = list("0123")
fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig.update_layout(width=800, height=800)
fig.show()

Code literally copied from the plotly website bc idk wth I'm supposed to do. This website: https://plotly.com/python/v3/dendrogram/

sentence
  • 8,213
  • 4
  • 31
  • 40

2 Answers2

1

You can choose a linkage method using scipy.cluster.hierarchy.linkage() via linkagefun argument in create_dendrogram() function.

For example, to use UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm:

import plotly.figure_factory as ff
import scipy.cluster.hierarchy as sch
import numpy as np

X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])

names = "0123"
fig = ff.create_dendrogram(X,
                           orientation='left',
                           labels=names,
                           linkagefun=lambda x: sch.linkage(x, "average"),)
fig.update_layout(width=800, height=800)
fig.show()

Please, note that X has to be a matrix of data samples.

sentence
  • 8,213
  • 4
  • 31
  • 40
  • There is no error indeed. It depends on the method you specify. To give you an example, I used `average`. You can see other linkage methods [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) and the code [here](https://github.com/scipy/scipy/blob/v1.4.1/scipy/cluster/hierarchy.py#L833-L1077). – sentence Apr 05 '20 at 11:45
  • but it is supposed to group lower distance values first. "average" is the method I need but the grouping is wrong. Since (1,3) has distance 1, It should group 1,3 together but when executed it grouped (0,1) together. – user13226847 Apr 05 '20 at 11:49
  • Ok, I see. `X` has to be a matrix of data samples. NOT a distance matrix. – sentence Apr 05 '20 at 11:57
  • Than how do I convert a NxN pairwise distance matrix to a "1d condensed distance matrix" that linkage needs? scipy.spatial.distance.pdist doesn't seem to do the trick. (it requires 2x2 matrix) – user13226847 Apr 05 '20 at 12:04
  • I used `scipy.spatial.distance.squareform` to (supposedly) convert the pairwise matrix to condensed matrix, but when I ran the code I got `in get_dendrogram_traces d=distfun(X) in pdist raise ValueError: A 2-dimentional array must be passed.` How do I fix this? – user13226847 Apr 05 '20 at 12:16
  • You have to simply use your dataset. No conversion. Just use for `X` a matrix of data samples. Rows are samples, columns are features. – sentence Apr 05 '20 at 13:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/210995/discussion-between-sentence-and-user13226847). – sentence Apr 05 '20 at 13:18
  • But is there a way to get the distance matrix that plolty dendrogram is based on?? – Newbielp Dec 09 '21 at 11:17
1

This is a bit old but, for anyone else with similar issues, I think the distfun param simply specifies how you want to convert your data matrix to a condensed distance matrix - you define the function yourself.

For example, after a bit of head banging I cobbled together data_to_dist to convert a data matrix to a Jaccard distance matrix, then condense it. You should be aware that plotly's dendrogram implementation does not check whether your matrix is condensed so your distfun needs to ensure this occurs. Maybe this is wrong, but it looks like distfun should only take one positional param (the data matrix) and return one object (the condensed distance matrix):

import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import jaccard, squareform

def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
    all_features = set([i for i in feature_list1 if i != filler_val])#filler val can be used to even up ragged lists and ignore certain dtypes ie prots not in a module
    all_features.update(set([i for i in feature_list2 if i != filler_val]))#works for both numpy arrays and lists
    counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
    counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
    return jaccard(counts_1, counts_2)

def data_to_dist_matrix(mn_data, filler_val = 0):
    #notes:
        #the original plotly example uses pdist to find manhatten distance for clustering.  
        #pdist 'Returns a condensed distance matrix Y' - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist.
        #a condensed distance matrix is required for input into scipy linkage for clustering.  
        #plotly dendrogram function does not do this conversion to the output of a given distfun call - https://github.com/plotly/plotly.py/blob/cfad7862594b35965c0e000813bd7805e8494a5b/packages/python/plotly/plotly/figure_factory/_dendrogram.py#L340
        #therefore you should convert distance matrix to condensed form yourself as below with squareform
    distance_matrix = np.array([[jaccard_dissimilarity(a,b, filler_val) for b in mn_data] for a in mn_data])
    return squareform(distance_matrix)



# toy data to visually check clustering looks sensible
data_array = np.array([[1, 2, 3,0], 
                       [2, 3, 10, 0], 
                       [4, 5, 6, 0],
                       [5, 6, 7, 0],
                       [7, 8, 1, 0],
                       [1,2,8,7],
                       [1,2,3,8],
                       [1,2,3,4]])

y_labels = [f'MODULE_{i}' for i in range(8)]

#this is the distance matrix and condensed distance matrix made by data_to_dist_matrix and is only included so I can check what it's doing
dist_matrix = np.array([[jaccard_dissimilarity(a,b, 0) for b in data_array] for a in data_array])
condensed_dist_matrix = data_to_dist_matrix(data_array, 0)

# Create Side Dendrogram
fig = ff.create_dendrogram(data_array, 
                           orientation='right', 
                           labels = y_labels,
                           distfun = data_to_dist_matrix)
Tim Kirkwood
  • 598
  • 2
  • 7
  • 18
  • Thanks, but I'm a bit confused regarding condensed/non-condensed distance matrix. If I have a distance matrix to begin with (and not the data themselves), can I just pass `distfun=None`? Will this create the dendrograms based on the distances in the original matrix? – soungalo Jun 06 '22 at 08:07
  • Perhaps define a distfun with input distance matrix X and output condensed distance matrix sX (or a distfun that returns the input if you have a squareform distance matrix to start with). I'm hazy about what the condensed distance matrix is, but these links might help - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html and https://stackoverflow.com/questions/13079563/how-does-condensed-distance-matrix-work-pdist. IIRC its removing the duplicate values found in every distance matrix (as they are symmetrical across the diagonal), but could be wrong. – Tim Kirkwood Jun 07 '22 at 09:23