2

I am trying to convert a dictionary to a distance matrix that I can then use as an input to hierarchical clustering: I have as an input:

  • key: tuple of length 2 with the objects for which I have the distance
  • value: the actual distance value

    for k,v in obj_distances.items():
    print(k,v)
    

and the result is :

('obj1', 'obj2') 2.0 
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80

My question is how can I convert this into a distance matrix that I can later user for clustering in scipy?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Daniel Vieira
  • 461
  • 5
  • 19
  • 1
    You can first create a matrix of zeros and then use `int(a[-1])` as the indices of your matrices where `a` is `'obj1'`, `'obj2'` etc. and store the distance values in that array – Sheldore Aug 03 '18 at 14:12
  • 1
    Is your set of distances complete? That is, does the distance dictionary contain a distance for each possible pair? For example, the data that you show doesn't include distances for `('obj1', 'obj4')` or `('obj2', 'obj4')`. You'll need these values to do clustering. – Warren Weckesser Aug 03 '18 at 16:18
  • Hi @WarrenWeckesser, yes it is, I just omitted it to save space, but yes I have all pairwise distances, thanks – Daniel Vieira Aug 03 '18 at 17:10

3 Answers3

4

Use pandas and unstack the dataframe:

import pandas as pd

data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}

df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values

yeilds

In [15]: dist_matrix
Out[15]:

array([[2.  , 1.95,  nan],
       [ nan, 1.8 ,  nan],
       [ nan,  nan, 1.58]])
Sam
  • 4,000
  • 20
  • 27
3

You say you will use scipy for clustering, so I assume that means you will use the function scipy.cluster.hierarchy.linkage. linkage accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist), for a discussion on the condensed form.)

So all you have to do is get obj_distances.values() into a known order and pass that to linkage. That's what is done in the following snippet:

from scipy.cluster.hierarchy import linkage, dendrogram

obj_distances = {
    ('obj2', 'obj3'): 1.8,
    ('obj3', 'obj1'): 1.95,
    ('obj1', 'obj4'): 2.5,
    ('obj1', 'obj2'): 2.0,
    ('obj4', 'obj2'): 2.1,
    ('obj3', 'obj4'): 1.58,
}

# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b.  If this is already true, then the next three lines can be
# replaced with
#     sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)

# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendrogram(Z, labels=labels)

The dendrogram:

dendrogram

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

This will be slower than the other answer posted, but will ensure that values both above and below the middle diagonal are included, if that's important to you:

import pandas as pd

unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)

for k, v in obj_distance.items():
    df.loc[k[0], k[1]] = v
    df.loc[k[1], k[0]] = v

Results:

      obj1 obj2  obj3  obj4
obj1   NaN    2  1.95   NaN
obj2     2  NaN   1.8   NaN
obj3  1.95  1.8   NaN  1.58
obj4   NaN  NaN  1.58   NaN
sjw
  • 6,213
  • 2
  • 24
  • 39