I have a rather large data set (500k+ lines), where I am trying to identify potential duplicates of each row.
Currently, the code I have compared each row to all of the rows in a loop (so row 1 vs 1:500,000, then row 2 vs rows 1:500,000) and so on. The output of the loop would be the creation of two new columns - the row number of the most likely duplicate, and a duplicate score (which I have written functions for).
Obviously this code is very computationally intensive!
So I came up with a proposed solution:
Let's say you have dataframe df
with 5 rows: A,B,C,D, and E:
df:
A
B
C
D
E
In the first iteration of the loop, a matrix/vector will be created with the scores of row A compared to each row in the dataframe:
(A vs A)
(A vs B)
(A vs C)
(A vs D)
(A vs E)
After this vector is created, more columns would be added to this matrix so that it looks like this (because A vs B is the same as B vs A for example):
(A vs A) (A vs B) (A vs C) (A vs D) (A vs E)
(A vs B) NA NA NA NA
(A vs C) NA NA NA NA
(A vs D) NA NA NA NA
(A vs E) NA NA NA NA
The loop would then pick the highest score to return back into the original dataframe and the position of the highest score so that output would look like this (in this case A vs B):
df:
Row: PositionNumber HighestScore
A 2 (A vs B)
B NA NA
C NA NA
D NA NA
E NA NA
and the second iteration of the loop would produce the following outputs to the matrix and original dataframe:
Updated matrix: Updated data frame:
Row: PositionNumber HighestScore
(A vs A) (A vs B) (A vs C) (A vs D) (A vs E) A 2 (A vs B)
(A vs B) (B vs B) (B vs C) (B vs D) (B vs E) B 5 (B vs E)
(A vs C) (B vs C) NA NA NA C NA NA
(A vs D) (B vs D) NA NA NA D NA NA
(A vs E) (B vs E) NA NA NA E NA NA
The idea behind doing this is less computations power would be needed to calculated scores is between two rows, the score also applies to its reciprocal (A vs B
and B vs A
) for example.
Another idea I have: after returning the value into the data frame, would it make the process faster to delete the first column from the matrix since it's no longer needed? (I imagine a 500,000 by 500,000 matrix would hold a lot of memory).
I hope this makes sense - where do I begin with writing a for loop for this situation? I am quite new to R.
Thanks