I need your help to rewrite this so that I wont get memory error.
I have two dataframes containing laptops/pc's with their config.
DataFrame one called df_original:
processorName | GraphicsCardname | ProcessorBrand |
---|---|---|
5950x | Rtx 3060 ti | i7 |
3600 | Rtx 3090 | i7 |
1165g7 | Rtx 3050 | i5 |
DataFrame two df_compare:
processorName | GraphicsCardname | ProcessorBrand |
---|---|---|
5950x | Rtx 3090 | i7 |
1165g7 | Rtx 3060 ti | i7 |
1165g7 | Rtx 3050 | i5 |
What I would like to do is calculate if they are similar. By similar meaning, check each value in the column and compare it to the same column value. For example comparing 5950x to 1165g7 (processorName). These features has values (weights) for example processorName has a weight of 2.
So for each row of df1 I want to check if they have the same config in df2 If yes do nothing, if not add their value to a variable called weight. For example if two rows are the same, only the processorName is differs, then the weight is going to be 2 because processorName has a value of 2.
This is what I am doing:
values=[]
for i, df_orig in enumerate(df_original):
values.append([])
for df_comp in df_compare:
values[i].append(calculate_values(df_orig, df_comp, columns))
def calculate_values(df_orig, df_comp, columns):
weight = 0
for i, c in enumerate(df_orig):
if df_comp[i] != c:
weight += get_weight(columns[i]) #just gets their so called weight like 2 if they don’t have the same processorName
return weight
The output for values would be like values = [[2,2,6],[2,4,6] ... ]
the output values =[ [2,2,6],[2,4,6]...]
it means that values[0]
is the first row in the df_original values[0][0]
is the weight compared first row from df_original and first row from df_compare values[0][1]
is the weight from the first row from df_original and the second row from the df_compare and thats how it goes on
This 3x for loop is very slow and giving me MemoryError. I am working with around 200k rows each.
Would you mind helping me rewrite this into a faster way?
Thanks
` tags in your table becase we cannot copy the data using `pd.read_clipboard`. Also, make sure you provide and expected output because you never defined `calculate_values` or `get_weight` in your question. – It_is_Chris Oct 19 '22 at 15:27