0

I have a programme that reads a spreadsheet of properties into one DataFrame, then queries a SQL database and makes another DataFrame, and then runs a cosine similarity function against the two to tell me which addresses in the spreadsheet are on my database.

The code for my cosine similarity function is below, along with some helper functions. I have the problem that, on a sheet of hundreds or thousands of addresses, it is very slow because it uses a nested for loop to create a list of the best similarities for every address.

import string
import math
import re
from collections import Counter

WORD = re.compile(r"\w+")
    
def clean_address(text):
  text = ''.join([word for word in text if word not in string.punctuation])
  text = text.lower()
  return text

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)
  
def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
    
def getCosineSimilarities(internalDataframe, externalDataframe):
    similarities = []
    internalAddressColumn = internalDataframe['Address']
    internalPostcodeColumn = internalDataframe['postcode']
    externalAddressColumn = externalDataframe['full address']
    externalPostcodeColumn = externalDataframe['postcode']

    for i in range(len(internalDataframe)):
        bestSimilarity = 0
        for j in range(len(externalDataframe)):
            if internalPostcodeColumn.iloc[i].rstrip() == externalPostcodeColumn.iloc[j]:
                vector1 = text_to_vector(clean_address(internalAddressColumn.iloc[i]))
                vector2 = text_to_vector(clean_address(externalAddressColumn.iloc[j]))
                cosine = get_cosine(vector1, vector2)
                if cosine > bestSimilarity:
                    bestSimilarity = cosine
        similarities.append(bestSimilarity)
    
    return similarities

I'm sure it must be possible to create the list "similarities", returned by getCosineSimilarities, using a list comprehension or something similar, but I can't work out the best way to do it.

Please can someone help?

Edit: internalDataframe.head(5)

     Name              postcode    Created  
0    Mr Joe Bloggs     SW6 6RD     2020-10-21 14:15:58.140            
1    Mrs Joanne Bloggs SE17 1LN    2013-06-27 14:52:29.417
2    Mr John Doe       SW17 0LN    2017-02-23 16:22:03.630
3    Mrs Joanne Doe    SW6 7JX     2019-07-03 14:52:00.773
4    Mr Joe Public     W5 2RX      2012-11-19 10:28:47.863

externalDataframe.head(5)

address_id  category beds postcode 
1005214     FLA      2    NW5 4DA  
1009390     FLA      2    NW5 1PB  
1053948     FLA      2    NW6 3SJ  
1075629     FLA      2    NW6 7UP
1084325     FLA      2    NW6 7YQ 
Anders
  • 115
  • 11
  • 1
    It's hard to give a complete answer without some example dataframes, but I would suggest using `pd.merge` to join your `internalDataframe` and `externalDataframe`. Then you can use `groupby` on the `postcode` column to subset. There will still be loops hidden inside the `merge` function, but numpy and pandas are optimized for these operations. – SNygard Jun 10 '21 at 15:06
  • Could you please show head(5) or sample(5) from your data frames? – aerijman Jun 10 '21 at 15:08
  • @aerijman it hasn't printed everything - I've removed the ellipses - but above is roughly what .head(5) prints – Anders Jun 10 '21 at 15:20

3 Answers3

0

As you said, the issue here is the nested loop. You are performing quite a few costly operation on externalDataframe for each item in internalDataframe:

  • text_to_vector involves a regex findall and a counter creation. You can memoize the values from externalDataframe and modify your function accordingly
  • get_cosine involves a casting to setand a sum of powers of all items in a counter. Again, you can memoize the values from externalDataframe and modify your function accordingly. In this case you may want to memoize the results from internalDataframe too
  • less important, for x in list(vec1.keys()) is redundant: you force the translation of a dict_keys to a list (one iteration) and then iterate on that list (another iteration). Just do for x in vec1.keys()
  • even less important, you may check if one of sum1 or sum2 is zero before calculating the product of their square roots, instead of checking if that product is zero
gimix
  • 3,431
  • 2
  • 5
  • 21
0

It seems that you need something like a distance matrix. Based on this SO answer, this is a sketch of how you can compare all string pairs from two dataframes' columns:

import pandas as pd
import numpy as np
from collections import Counter
import math

def text2vec(text):
    # just a naive transformation
    return Counter(text.split())

def get_cosine(text1, text2):
    """Modified version of your function – you might want to improve 
       it some more following gimix's advice or even better, make 
       full use of numpy arrays"""
    vec1, vec2 = text2vec(text1), text2vec(text2)
    
    intersection = set(vec1) & set(vec2)
    numerator = sum(vec1[x] * vec2[x] for x in intersection)

    sum1 = sum(v ** 2 for v in vec1.values())
    sum2 = sum(v ** 2 for v in vec2.values())
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
    
# Makes your function vector-ready
cos_sim = np.vectorize(get_cosine)

# Some pseudo data
data = {"address":["An address in some city", 
                   "Cool location in some town", 
                   "100 places to see before you die"]}
data2 = {"address":["Disney world", 
                    "An address in some city", 
                    "500 places to see before you die", 
                    "Neat location in some town"]}

df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

# Compare all combinations and combine to a new dataframe
# This is 1:1 adopted from the answer linked above
cos_matrix = cos_sim(df.address.values, df2.address.values[:, None])
result_df = pd.concat((df2.address, 
                       pd.DataFrame(cos_matrix, 
                                    columns=df.address)), 
                       axis=1)

print(result_df)

Which gives you all the values and you can get the best ones with max:

                            address  An address in some...   Cool location in...  100 places to see...
0                      Disney world                    0.0                   0.0              0.000000
1           An address in some city                    1.0                   0.4              0.000000
2  500 places to see before you die                    0.0                   0.0              0.857143
3        Neat location in some town                    0.4                   0.8              0.000000

fsimonjetz
  • 5,644
  • 3
  • 5
  • 21
0

@SNygard deserves to be credited here because his comment led me in the right direction (someone else will need to say if the other two answers would help - I got motoring on this way and didn't look back).

I made a column in the internalDataFrame to keep the index as a usable value, created a dictionary to store the best similarity for each index (starting with all 0s), and then merged the two DataFrames as advised. This meant I only needed to loop through the merged DataFrame once and updated the similarity dictionary where relevant.

It reduced the time to process the similarities from ~15 seconds for a 500-address externalDataFrame to under 0.5 seconds and I've also run it against a 6,000-address externalDataFrame in 4.5 seconds, which I can't compare anything to because it used to take literally hours to process on the last version!

def getCosineSimilarities(internalDataframe, externalDataframe):
    
    internalDataframe['index'] = internalDataframe.index    
    combinedDf = pd.merge(internalDataframe, externalDataframe, on='postcode')
    similarities_dict = dict() 
    
    for i in range(len(internalDataframe)):
        index = internalDataframe['index'].iloc[i]
        similarities_dict[index] = 0   
        
    for i in range(len(combinedDf)):
        vector1 = text_to_vector(clean_address(combinedDf['Address'].iloc[i]))
        vector2 = text_to_vector(clean_address(combinedDf['full address'].iloc[i]))
        cosine = get_cosine(vector1, vector2)
        index = combinedDf['index'].iloc[i]
        if cosine > similarities_dict[index]:
            similarities_dict[index] = cosine       
    
    similarities = []
    
    for key, value in similarities_dict.items():
        similarities.append(value)
    
    return similarities
Anders
  • 115
  • 11