I have a programme that reads a spreadsheet of properties into one DataFrame, then queries a SQL database and makes another DataFrame, and then runs a cosine similarity function against the two to tell me which addresses in the spreadsheet are on my database.
The code for my cosine similarity function is below, along with some helper functions. I have the problem that, on a sheet of hundreds or thousands of addresses, it is very slow because it uses a nested for loop to create a list of the best similarities for every address.
import string
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def clean_address(text):
text = ''.join([word for word in text if word not in string.punctuation])
text = text.lower()
return text
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def getCosineSimilarities(internalDataframe, externalDataframe):
similarities = []
internalAddressColumn = internalDataframe['Address']
internalPostcodeColumn = internalDataframe['postcode']
externalAddressColumn = externalDataframe['full address']
externalPostcodeColumn = externalDataframe['postcode']
for i in range(len(internalDataframe)):
bestSimilarity = 0
for j in range(len(externalDataframe)):
if internalPostcodeColumn.iloc[i].rstrip() == externalPostcodeColumn.iloc[j]:
vector1 = text_to_vector(clean_address(internalAddressColumn.iloc[i]))
vector2 = text_to_vector(clean_address(externalAddressColumn.iloc[j]))
cosine = get_cosine(vector1, vector2)
if cosine > bestSimilarity:
bestSimilarity = cosine
similarities.append(bestSimilarity)
return similarities
I'm sure it must be possible to create the list "similarities", returned by getCosineSimilarities, using a list comprehension or something similar, but I can't work out the best way to do it.
Please can someone help?
Edit: internalDataframe.head(5)
Name postcode Created
0 Mr Joe Bloggs SW6 6RD 2020-10-21 14:15:58.140
1 Mrs Joanne Bloggs SE17 1LN 2013-06-27 14:52:29.417
2 Mr John Doe SW17 0LN 2017-02-23 16:22:03.630
3 Mrs Joanne Doe SW6 7JX 2019-07-03 14:52:00.773
4 Mr Joe Public W5 2RX 2012-11-19 10:28:47.863
externalDataframe.head(5)
address_id category beds postcode
1005214 FLA 2 NW5 4DA
1009390 FLA 2 NW5 1PB
1053948 FLA 2 NW6 3SJ
1075629 FLA 2 NW6 7UP
1084325 FLA 2 NW6 7YQ