The problem is that you are passing pandas Series to hamming_dist
and not strings. One solution is to use itertools.product
to generate the pairs of strings:
import pandas as pd
from itertools import product
def hamming_dist(str1, str2):
hamming = 0
for letter in range(len(str1)):
if str1[letter] != str2[letter]:
hamming += 1
return hamming
hamming_df = pd.DataFrame(["hello", "yello"], columns=["protein"])
res = pd.DataFrame([hamming_dist(*p) for p in product(hamming_df["protein"], repeat=2)], columns=["hamming_protein"])
print(res)
Output
hamming_protein
0 0
1 1
2 1
3 0
An alternative is to use scipy.spatial.pdist
to compute the distances:
from scipy.spatial.distance import pdist, squareform
hamming_df = pd.DataFrame(["hello", "yello"], columns=["protein"])
arr = squareform(pdist(hamming_df["protein"].to_numpy().reshape((-1, 1)), metric=hamming_dist)).flatten()
res = pd.DataFrame(arr, columns=["hamming_protein"])
print(res)
Output
hamming_protein
0 0.0
1 1.0
2 1.0
3 0.0
Note
I suggest you use the following hamming_dist function that will work for strings of different length:
def hamming_dist(str1, str2):
return sum(l1 != l2 for l1, l2 in zip(str1, str2)) + abs(len(str1) - len(str2))
UPDATE
If the output is a distance matrix, I suggest you use pdist
as follows:
from scipy.spatial.distance import pdist, squareform
hamming_df = pd.DataFrame(["hello", "yello"], columns=["protein"])
arr = squareform(pdist(hamming_df["protein"].to_numpy().reshape((-1, 1)), metric=hamming_dist))
res = pd.DataFrame(arr, columns=hamming_df["protein"], index=hamming_df["protein"])
print(res)
Output
protein hello yello
protein
hello 0.0 1.0
yello 1.0 0.0