21

I have the following Pandas DataFrame:

In [31]:
import pandas as pd
sample = pd.DataFrame({'Sym1': ['a','a','a','d'],'Sym2':['a','c','b','b'],'Sym3':['a','c','b','d'],'Sym4':['b','b','b','a']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
      Sym1 Sym2 Sym3 Sym4
Item1    a    a    a    b
Item2    a    c    c    b
Item3    a    b    b    b
Item4    d    b    d    a

and I want to find the elegant way to get the distance between each Item according to this distance matrix:

In [34]:
DistMatrix = pd.DataFrame({'a': [0,0,0.67,1.34],'b':[0,0,0,0.67],'c':[0.67,0,0,0],'d':[1.34,0.67,0,0]},index=['a','b','c','d'])
print(DistMatrix)
Out[34]:
      a     b     c     d
a  0.00  0.00  0.67  1.34
b  0.00  0.00  0.00  0.67
c  0.67  0.00  0.00  0.00
d  1.34  0.67  0.00  0.00 

For example comparing Item1 to Item2 would compare aaab -> accb -- using the distance matrix this would be 0+0.67+0.67+0=1.34

Ideal output:

       Item1   Item2  Item3  Item4
Item1      0    1.34     0    2.68
Item2     1.34    0      0    1.34
Item3      0      0      0    2.01
Item4     2.68  1.34   2.01    0
Clayton
  • 1,525
  • 5
  • 19
  • 35

3 Answers3

36

This is an old question, but there is a Scipy function that does this:

from scipy.spatial.distance import pdist, squareform

distances = pdist(sample.values, metric='euclidean')
dist_matrix = squareform(distances)

pdist operates on Numpy matrices, and DataFrame.values is the underlying Numpy NDarray representation of the data frame. The metric argument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. It's very powerful and, in my experience, very fast. The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). squareform then translates this flattened form into a full matrix.

The docs have more info, including a mathematical rundown of the many built-in distance functions.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • 1
    Tip: plot the results easily with `from matplotlib import pyplot as plt; plt.imshow(dist_matrix, interpolation='none'); plt.colorbar(); plt.show()` – Niko Föhr Jan 25 '23 at 21:13
  • 1
    @np8 great tip, I love "heatmap" plots of that nature. However the ordering of the entries is arbitrary and can significantly change the appearance of the plot. In cases where there's no natural ordering of the data (e.g. least to greatest), in the past I have used RPy2 to call routines in the R `seriation` package to automatically determine a useful ordering: https://cran.r-project.org/package=seriation. There is a native Python package for it as well, but I haven't tried it yet: https://pypi.org/project/seriate/. Note the use of `pdist` in the example given in the latter project's Readme! – shadowtalker Jan 26 '23 at 01:12
10

For a large data, I found a fast way to do this. Assume your data is already in np.array format, named as a.

from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(a, a)

Below is an experiment to compare the time needed for two approaches:

a = np.random.rand(1000,1000)
import time 
time1 = time.time()
distances = pdist(a, metric='euclidean')
dist_matrix = squareform(distances)
time2 = time.time()
time2 - time1  #0.3639109134674072

time1 = time.time()
dist = euclidean_distances(a, a)
time2 = time.time()
time2-time1  #0.08735871315002441
Michelle Owen
  • 361
  • 1
  • 3
  • 11
6

this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean )

pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
                                  for (x, y) in zip( row1, row2 ) ) 
                         for (idx2, row2) in sample.iterrows( ) } 
                 for (idx1, row1 ) in sample.iterrows( ) } )

you can make it more readable by writing it in pieces:

# a helper function to compute distance of two items
dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )

# a second helper function to compute distances from a given item
xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }

# the pairwise distance matrix
pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124