I am doing a fuzzy similarity matching between all rows in 'name' column using python pyspark in Jupyter notebook. The expected output is to produce a column with the similar string together with the score for each of the string as a new column. My question is quite fimiliar with this question, it's just that the question is in R language and it used 2 datasets (mine is only 1). As I'm quite new to python, I'm quite confused how to do it. I'm also have used a simple code with similar function however not so sure how to run it for the dataframe.
Here is the code:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
#example I do for simple string
Str1 = "Apple Inc."
Str2 = "Jo Inc"
Distance = levenshtein_ratio_and_distance(Str1,Str2)
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
print(Ratio)
However, the code above only applicable for string. What is I want to run the dataframe as the input instead of string. For example, the input data is (Saying that dataset name is customer):
name
1 Ace Co
2 Ace Co.
11 Baes
4 Bayes Inc.
8 Bayes
12 Bays
10 Bcy
15 asd
13 asd
The expected outcome is:
name b_name dist
Ace Co Ace Co. 0.64762
Baes Bayes Inc., Bayes,Bays, Bcy 0.80000,0.86667,0.70000,0.97778
asd asdf 0.08333