How to do effective matrix computation and not get memory overload for similarity scoring?

Question

I have the following code for similarity scoring:

from rapidfuzz import process, fuzz
import pandas as pd

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog and Cat. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. And so on.

Code generates expected result:

    name        cluster_number id
0   South Beach 1              1
1   Dog         2              2
2   Bird        3              3
3   Ant         3              4
4   Big Dog     2              2
5   Beach       1              1
6   Dear        4              5
7   Cat         2              6

Code above represents efficient and vectorized method for similarity scoring. It works perfectly for small data sets but when I try a dataframe with 1 million rows I get a memoryError for function rapidfuzz.process.cdist(...). As mention in comment section bellow this function returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float). So for 1 million names, the result matrix would require about 4 terabytes of memory. My PC has 12GB of free RAM space but it is not near enough. Any ideas how to avoid overloading RAM but keep computation in vectorized form?

For @J.M.Arnold solution including his comment the code may be rewritten as:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
names = df_test["name"]
def calculate_similarity_matrix(names):
    scores = pd.DataFrame(process.cdist(names, names, workers=-1),  columns=names, index=names)
    return scores
chunks = np.array_split(names, 1000)
_ = []
for i, chunk in enumerate(chunks):
    matrix = calculate_similarity_matrix(chunk)
    _.append(matrix)
finished = pd.concat(_)
x, y = np.where(finished > 50)
groups = (pd.DataFrame(finished.index[x], finished.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")

But it will not generate correct results:

          name  cluster_number             id
0        Beach               1              2
1  South Beach               1              8
2      Big Dog               2              3
3          Cat               2              5
4          Dog               2              7
5          Ant               3              1
6         Bird               3              4
7         Dear               4              6

Note as e.g. Dog and Big Dog have different id but they should have the same.

See https://github.com/maxbachmann/RapidFuzz/issues/302 – maxbachmann Dec 13 '22 at 23:49 — maxbachmann, Dec 13 '22 at 23:49

score 4 · Answer 1 · answered Dec 16 '22 at 13:33

As maxbachmann said in your GitHub issue it's all about about the default type:

default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float)

If you take a look at the documentation of rapidfuzz.process.dist you can see that the data-type is specified as follows:

similarity: - np.float32, np.float64 - np.uint8 -> stores fixed point representation of the result scaled to a range 0-100

distance: - np.int8, np.int16, np.int32, np.int64

If not given, then the type will be np.float32 for similarities and np.int32 for distances.

You can calculate the size of the matrix via len(queries) x len(choices) x size(dtype) which for your current implementation is 1 million x 1 million x 8 bytes (for float - which is the default for the scorer you are using). That is about 7.6TB! (Even for int32 with 4 bytes - as Max Bachmann mentioned) you end up with 3.8 TB of required space.

One option to dodge your issue is to decrease the size of the dtype - e.g. using int8 with 1 byte. Obviously, you will have significantly less accurate similarity scores as the value range is from -128 to 127! With the above mentioned formula you would be able to decrease the size down to ~950GB!

Another approach (and probably the only viable in the long-run) is to split up the data and process it in smaller chunks - as Max Bachmann suggested.

Define a function that handles the calculation of the similarity scores for a matrix. (Similar to your code)
Split the list of names into smaller pieces.
Iterate over the chunks and store the similarity matrix for each step.
Concatenate the results into one big matrix.

import numpy as np

# Step 1
def calculate_similarity_matrix(names):
    # Do your part, e.g. processing and so forth. But after all, return the similarity matrix for "names"
    scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
    return scores

# Step 2
# Split the names list into chunks - e.g. in portions of 1000 names each
chunks = np.array_split(names, 1000)

# Step 3
# Iterate over the names and store the matrix on the disk
for i, chunk in enumerate(chunks):
    matrix = calculate_similarity_matrix(chunk)
    matrix.to_pickle(f"matrix_{i}.pkl")

# Step 4
# Read the matrices
matrices = [pd.read_pickle(f"matrix_{i}.pkl") for i in range(len(chunks))]
# Concatenate
finished = pd.concat(matrices)

Afterwards, you will have the full calculated similarity matrix in finished!

This approach will allow you to process larger data sets without running out of memory / getting a memory overload (as your question asked)! This is because of the fact that the matrices are stored on your disk between iterations.

However, my approach will be definitely slower (in comparison to processing all the data at once - which is not possible unless you had 3TB+ of RAM) as you will need to read and write to your disk 1,000 times.

Obviously, you can play around with the chunk amounts you're using. In my current approach you have 1,000 chunks with 1,000 names each. Each step (with float as 8 bytes) will only require 8MB of RAM as per our above formula. You can play around and fit your optimum for your hardware!

@thank you for the answer. I have some concerns about this solution. 1) Consider the following example: `d_test = {'name' : ['South Beach', 'Dog', 'cat', 'Beach'], 'cluster_number' : [1, 2, 3, 1]}`. The dataframe `finished` for this case will have only diagonal elements and all not diagonal elements will be `NaNs`. But there should be non zero similarity score between `'South Beach'` and '`Beach'` — illuminato, Dec 17 '22 at 04:55
2) Let's assume a max cluster size is 100. If can sort and next split dataframe on chunks with size 100 then in some chucks that could be in the middle of cluster (say first cluster ends with `cluster_number`: `[, ... 531, 532, 532, 532, 532] and next cluster starts with `[532, 532, 532, 533, .... ]`). Is it possible to split dataframe so we do not have the same cluster number in different chunks? — illuminato, Dec 17 '22 at 04:58
1) Maybe I don't understand your application properly, but shouldn't that be handled with `process.extract` instead of `.cdist`? 2) In order to fix this issue, you could sort the dataframe by the cluster number value and then split it into chunks! :) Basically, just sorting the frame via `df = df.sort_values(by='cluster_number')` followed by the split of the chunks `chunks = np.array_split(df, len(df) // 100)` (or whatever size you want to have in that case). — J. M. Arnold, Dec 17 '22 at 09:39
as a small correction: the default type `float` is 4 bytes (`double` would be 8 bytes) — maxbachmann, Dec 17 '22 at 18:31
@J.M.Arnold I rewrote my question so it is more clear what expected results should be. May you please take a look? — illuminato, Dec 19 '22 at 03:21

How to do effective matrix computation and not get memory overload for similarity scoring?

1 Answers1