I have the following code for similarity scoring:
from rapidfuzz import process, fuzz
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")
I want to identify similar names in name
column if those names belong to one cluster number and create unique id for them. For example South Beach
and Beach
belong to cluster number 1
and their similarity score is pretty high. So we associate it with unique id, say 1
. Next cluster is number 2
and three entities from name
column belong to this cluster: Dog
, Big Dog
and Cat
. Dog
and Big Dog
have high similarity score and their unique id will be, say 2
. For Cat
unique id will be, say 3
. And so on.
Code generates expected result:
name cluster_number id
0 South Beach 1 1
1 Dog 2 2
2 Bird 3 3
3 Ant 3 4
4 Big Dog 2 2
5 Beach 1 1
6 Dear 4 5
7 Cat 2 6
Code above represents efficient and vectorized method for similarity scoring. It works perfectly for small data sets but when I try a dataframe with 1 million rows I get a memoryError
for function rapidfuzz.process.cdist(...)
. As mention in comment section bellow this function returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float). So for 1 million names, the result matrix would require about 4 terabytes of memory. My PC has 12GB of free RAM space but it is not near enough. Any ideas how to avoid overloading RAM but keep computation in vectorized form?
For @J.M.Arnold solution including his comment the code may be rewritten as:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
names = df_test["name"]
def calculate_similarity_matrix(names):
scores = pd.DataFrame(process.cdist(names, names, workers=-1), columns=names, index=names)
return scores
chunks = np.array_split(names, 1000)
_ = []
for i, chunk in enumerate(chunks):
matrix = calculate_similarity_matrix(chunk)
_.append(matrix)
finished = pd.concat(_)
x, y = np.where(finished > 50)
groups = (pd.DataFrame(finished.index[x], finished.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")
But it will not generate correct results:
name cluster_number id
0 Beach 1 2
1 South Beach 1 8
2 Big Dog 2 3
3 Cat 2 5
4 Dog 2 7
5 Ant 3 1
6 Bird 3 4
7 Dear 4 6
Note as e.g. Dog
and Big Dog
have different id
but they should have the same.