I have 2 .csv files. One of them with new data has ~100 rows and another is a reference book with ~40k rows.
I want to compare all strings from the first file (one by one) to all strings from second and calculate the Levenshtein distance to the most similar string.
After that i need to create a third file with the results. (with all data from first file, the max Levenshtein distance, and the string from the second file)
For example:
File A (new data):
Spam
Foo
File B (reference book):
Bar 1 0
Spamm 2 1
Spann 3 0
Booo 1 0
Fooo 2 2
Bo 3 3
...
What i need (Result File), where n = Levenshtein distance:
Spam n Spamm
Foo n Fooo
For now my code is:
def calculate_Leven(source, ref, result_file):
with open(source, 'r') as input1:
with open(ref, 'r') as input2:
with open(result_file, 'w') as csvoutput:
reader1 = csv.reader(input1)
reader2 = list(csv.reader(input2))
writer = csv.writer(csvoutput)
result = []
headers = next(reader1)
result.append(headers)
for row1 in reader1:
print "First row"
max = 0
while max < 1
for row2 in reader2:
a = distance(row1[0],row2[0])
b = 1 - a/len(row1[0])
if b > max:
max = b
SKU = row2[0]
row1.append(max)
row1.append(SKU)
result.append(row1)
writer.writerows(result)
Where distance is a function to calculate Levensthein distance.
This code works but is extremely slow. Is there a better way to structure this, or an alternative path that is more efficient? I have about 100 new files per day to check against the reference book, so the low speed is a bottleneck.