Optimizing 3x nested loop to avoid MemoryError working with big datasets

Question

I need your help to rewrite this so that I wont get memory error.

I have two dataframes containing laptops/pc's with their config.

DataFrame one called df_original:

processorName	GraphicsCardname	ProcessorBrand
5950x	Rtx 3060 ti	i7
3600	Rtx 3090	i7
1165g7	Rtx 3050	i5

DataFrame two df_compare:

processorName	GraphicsCardname	ProcessorBrand
5950x	Rtx 3090	i7
1165g7	Rtx 3060 ti	i7
1165g7	Rtx 3050	i5

What I would like to do is calculate if they are similar. By similar meaning, check each value in the column and compare it to the same column value. For example comparing 5950x to 1165g7 (processorName). These features has values (weights) for example processorName has a weight of 2.

So for each row of df1 I want to check if they have the same config in df2 If yes do nothing, if not add their value to a variable called weight. For example if two rows are the same, only the processorName is differs, then the weight is going to be 2 because processorName has a value of 2.

This is what I am doing:

values=[]
for i, df_orig in enumerate(df_original):
                values.append([])
                for df_comp in df_compare:
                    values[i].append(calculate_values(df_orig, df_comp, columns))


def calculate_values(df_orig, df_comp, columns):
        weight = 0
        for i, c in enumerate(df_orig):
            if df_comp[i] != c:
                weight += get_weight(columns[i])  #just gets their so called weight like 2 if they don’t have the same processorName
        return weight

The output for values would be like values = [[2,2,6],[2,4,6] ... ]

the output values =[ [2,2,6],[2,4,6]...] it means that values[0] is the first row in the df_original values[0][0] is the weight compared first row from df_original and first row from df_compare values[0][1] is the weight from the first row from df_original and the second row from the df_compare and thats how it goes on

This 3x for loop is very slow and giving me MemoryError. I am working with around 200k rows each.

Would you mind helping me rewrite this into a faster way?

Thanks

FYI - do not use `
` tags in your table becase we cannot copy the data using `pd.read_clipboard`. Also, make sure you provide and expected output because you never defined `calculate_values` or `get_weight` in your question. — It_is_Chris, Oct 19 '22 at 15:27
Thanks for updating the question. I am a little confused about how the values are calculated. Are you just looking for `((df_original != df_compare) *2).sum(axis=1).values`? — It_is_Chris, Oct 19 '22 at 16:05
I compare for example each processorName from df_original to the processorName of df_compare and do the same with GraphicsCardName and so on, now when they are not equal, meaning they are not the same string I get their weight, their weight is in a config file, there I have like the processorName : 2 , graphicsCardName:2 and so on so basically just how much value they has — My name is jeff, Oct 19 '22 at 16:28
When you say you compare "each processorName from df_original to the processorName of df_compare" do you mean by the index or each item compared to every row - i.e., `5950x` will be compared against all the rows in `df_compare` or just row index 0 because that is its position in df_original? — It_is_Chris, Oct 19 '22 at 16:32
yes the processorName compare is going to be compared only with processorName in our example it's on index 0, so no, not against all rows just agains processorNames — My name is jeff, Oct 19 '22 at 16:44
Is the output for these data frames 3 values or 9 - i.e., is df_original row index 0`5950x` compared to every row in df_compare (row index 0, 1 and 2)`[5950x, 1165g7, 1165g7]` or just row index 0 df_original row index 0 `5950x` to df_compare row index 0 `5950x` — It_is_Chris, Oct 19 '22 at 16:49
it will help if you add the complete expected output based on your sample data and the weights for each column. — It_is_Chris, Oct 19 '22 at 16:51
when I wrote about the output `arr=[ [2,2,6],[2,4,6]...]` it means that arr[0] is the first row in the df_original `arr[0][0]` is the weight compared first row from df_original and first row from df_compare `arr[0][1]` is the weight from the first row from df_original and the second row from the df_compare and thats how it goes on — My name is jeff, Oct 19 '22 at 16:55
Please don't add "solved" to your question title or body. See [what should I do when someone answers](https://stackoverflow.com/help/someone-answers) on how to show you've solved your problem. — Adriaan, Oct 20 '22 at 09:02

jrudolf · Answer 1 · 2022-10-19T17:16:22.617

Memory issue

Your desired output for such big dataframes can't fit into regular pc memory.

With two DataFrames of ~200k rows you are calculating a 2d array (matrix) of shape 200k times 200k. That's 40 billion values - 160 GB assuming the size of the data type is 4B.

I would suggest you reconsider the need of whole similarity matrix. The structure you might want to replace it with depends on what you want to do with the matrix:

Do you want to find devices with similarity higher than x? Store their indices into a list as pairs
Do you want to find N most similar devices? You can use some kind of priority queue ordered by similarity score and remove lower values during calculation.

Speed issue

Comparing strings is expensive as every single character has to be compared. Since the only operation performed with the data is equality they can be either hashed:

pd.util.hash_array(df.to_numpy().flatten()).reshape(df.shape)

or converted into category type:

df.astype('category'). However this improvement alone won't make enough difference.

As stated in this answer, the fastest way to iterate rows is using df.itertuples. But again your dataset it too large and only iterating without computation would take more than 5 hours. For loops in python are unfit for large data processing.

Marginally faster approach would be to use vector operations provided by pandas or numpy. This would be one way to do it using hashing and numpy:

import pandas as pd
import numpy as np
from tqdm import tqdm

def df_to_hashed_array(df):
    return pd.util.hash_array(df.to_numpy().flatten()).reshape(df.shape)

original_array = df_to_hashed_array(df_original)
compare_array = df_to_hashed_array(df_compare)

weights = np.array([1, 2, 3])  # vector of weights for each column instead of function

values = []

for i, row in enumerate(tqdm(original_array)):  # for each row in original
    # Compare row to whole compare_array. Output has 1 where the values of original_array differs from row
    diff_values_array = original_array != row

    # Mutliply by weights
    weighted_diff = diff_values_array * weights

    # Sum weights on each row of compare_array
    # output is vector 
    row_weights = weighted_diff.sum(axis=1)

    # EXAMPLE: find values with weight less than 1
    filtered_indices = np.where(row_weights < 1)
    values.extend((i, j[0]) for j in filtered_indices)

Output here is list of tuples (df_original index value, df_compare index value) satisfying the condition. Dataframes are required to have index 0 to N to map these pairs back to their original values.

This code still runs ~20 minutes. I don't think it can get much better for o problem with quadratic complexity and large data.

Hey thank you very much for this. With your help and with the other answer I could rewrite the for loop into a much quicker run time. — My name is jeff, Oct 20 '22 at 08:21

It_is_Chris · Accepted Answer · 2022-10-20T13:22:15.867

1

IIUC try using list comprehension - note that this assumes every column has a weight of 2

lst = [list(((val != df_compare)*2).sum(axis=1)) for val in df_original.values]

# -> [[2, 2, 6], [2, 4, 6], [6, 4, 0]]

If you have custom weights try

weights = {'processorName': {True: 1},
           'GraphicsCardname': {True: 2},
           'ProcessorBrand': {True: 3}}

lst = [list((val != df_compare).replace(weights).astype(int).sum(axis=1)) 
       for val in df_original.values]

# -> [[2, 1, 6], [1, 3, 6], [6, 5, 0]]

edited Oct 20 '22 at 13:22

answered Oct 19 '22 at 17:24

It_is_Chris

13,504
2
23
41

Thanks for the answer, one question how would I my weights are stored like you mentioned in your `weight` but without the `{True:1}` part, why is that necessary I have only like processorName:2,graphicsCardName:4 and I am clearly not getting good results how come? – My name is jeff Oct 20 '22 at 07:41
If I were using this exact example but with numerical features insted of categorical, would this list comprehension be just as good to use for it or when it comes to numerical it's a whole different – My name is jeff Oct 20 '22 at 08:23
@Mynameisjeff You need the `{True: 1}` part because `val != df_compare` returns a boolean dataframe (True, False) and True, when converted to an int, is equal to 1. So we need to provide the column name (processorName), what we want to replace in the column (True) and the value to replace it with (1). It also should not matter if your data is categorical of numeric. – It_is_Chris Oct 20 '22 at 13:18

Optimizing 3x nested loop to avoid MemoryError working with big datasets

2 Answers2