1

I have a df as follows:

0    111155555511111116666611111111
1    555555111111111116666611222222
2    221111114444411111111777777777
3    111111116666666661111111111111
.......
1000  114444111111111111555555111111

I am calculating the distance between each string. For instance, to get the distance between the first 2 strings: textdistance.hamming(df[0], df[1]). This will return a single integer.

Now, I want to create a df that stores all the distance between each string. In this case, since I have 1000 strings, I will have a 1000 by 1000 df. The first value is distance between string 1 and itself, then string 1 and string2 and so on. Then in next row its string 2 and string1, string 2 and itself and so on.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Sakib Shahriar
  • 121
  • 1
  • 12

1 Answers1

2

Create all combinations of values of Series and get hamming distance in list, then convert to array and reshape for DataFrame:

import textdistance
from  itertools import product

L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

EDIT:

For improve performance use this solution with changed lambda function:

import numpy as np    
from scipy.spatial.distance import pdist, squareform

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(df).reshape(-1,1)

# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))

# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    This solution looks good, but, I'm stuck for more than 1 hour on my jupyter notebook for the execution to finish. Maybe itertools.product() is not suitable for larger size computation. In my case, I have 2000 rows – Sakib Shahriar Sep 09 '19 at 06:30
  • [`pdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist) has built-in support for hamming distance. I think you can just call it with `metric="hamming"` for better performance. – GZ0 Sep 17 '19 at 00:24
  • @GZ0 - I tested with `print (pdist(transformed_strings, metric="hamming"))` and also `print (pdist(pd.concat([df, df], axis=1).values, metric="hamming"))` and both returns `[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]`... – jezrael Sep 17 '19 at 07:25
  • @jezrael The distance function acts on arrays / lists. If two strings are passed to it, they will just be treated as two length-1 lists. Each string needs to be converted an array / list first. – GZ0 Sep 17 '19 at 14:23