3

I want to create a new dataframe which is a distance matrix of all hamming distances between values of a specific coulmn in existing dataframe. The hamming distance seems to work fine:

def hamming_dist(str1, str2):
    hamming = 0
    for letter in range(len(str1)):
            if str1[letter] != str2[letter]:
                hamming += 1
    return hamming

hamming_dist("hello", "heLLo")

Output

2

I want to compute the hamming for all values in df called hamming_df in coulmn called "protein".

hamming_distance_df = pd.DataFrame(hamming_dist(hamming_df["protein"], hamming_df["protein"]))\
    (index = hamming_df["protein"], columns=hamming_df["protein"])

The output is dataframe with correct indexes and columns, but all the values are 0 and not the actual hamming distance. Any ideas?

Thanks

abc_123
  • 31
  • 3

1 Answers1

2

The problem is that you are passing pandas Series to hamming_dist and not strings. One solution is to use itertools.product to generate the pairs of strings:

import pandas as pd
from itertools import product


def hamming_dist(str1, str2):
    hamming = 0
    for letter in range(len(str1)):
        if str1[letter] != str2[letter]:
            hamming += 1
    return hamming


hamming_df = pd.DataFrame(["hello", "yello"], columns=["protein"])

res = pd.DataFrame([hamming_dist(*p) for p in product(hamming_df["protein"], repeat=2)], columns=["hamming_protein"])
print(res)

Output

  hamming_protein
0                0
1                1
2                1
3                0

An alternative is to use scipy.spatial.pdist to compute the distances:

from scipy.spatial.distance import pdist, squareform

hamming_df = pd.DataFrame(["hello", "yello"], columns=["protein"])
arr = squareform(pdist(hamming_df["protein"].to_numpy().reshape((-1, 1)), metric=hamming_dist)).flatten()
res = pd.DataFrame(arr, columns=["hamming_protein"])
print(res)

Output

   hamming_protein
0              0.0
1              1.0
2              1.0
3              0.0

Note

I suggest you use the following hamming_dist function that will work for strings of different length:

def hamming_dist(str1, str2):
    return sum(l1 != l2 for l1, l2 in zip(str1, str2)) + abs(len(str1) - len(str2))

UPDATE

If the output is a distance matrix, I suggest you use pdist as follows:

from scipy.spatial.distance import pdist, squareform

hamming_df = pd.DataFrame(["hello", "yello"], columns=["protein"])
arr = squareform(pdist(hamming_df["protein"].to_numpy().reshape((-1, 1)), metric=hamming_dist))
res = pd.DataFrame(arr, columns=hamming_df["protein"], index=hamming_df["protein"])
print(res)

Output

protein  hello  yello
protein              
hello      0.0    1.0
yello      1.0    0.0
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76