0

I have a rather large data set (500k+ lines), where I am trying to identify potential duplicates of each row.

Currently, the code I have compared each row to all of the rows in a loop (so row 1 vs 1:500,000, then row 2 vs rows 1:500,000) and so on. The output of the loop would be the creation of two new columns - the row number of the most likely duplicate, and a duplicate score (which I have written functions for).

Obviously this code is very computationally intensive!

So I came up with a proposed solution:

Let's say you have dataframe df with 5 rows: A,B,C,D, and E:

df:
A
B
C
D
E

In the first iteration of the loop, a matrix/vector will be created with the scores of row A compared to each row in the dataframe:

(A vs A)
(A vs B)
(A vs C)
(A vs D)
(A vs E)

After this vector is created, more columns would be added to this matrix so that it looks like this (because A vs B is the same as B vs A for example):

(A vs A) (A vs B) (A vs C) (A vs D)  (A vs E)   
(A vs B)    NA       NA       NA        NA
(A vs C)    NA       NA       NA        NA    
(A vs D)    NA       NA       NA        NA   
(A vs E)    NA       NA       NA        NA 

The loop would then pick the highest score to return back into the original dataframe and the position of the highest score so that output would look like this (in this case A vs B):

df:

Row:  PositionNumber  HighestScore
A          2            (A vs B)
B          NA              NA 
C          NA              NA
D          NA              NA
E          NA              NA

and the second iteration of the loop would produce the following outputs to the matrix and original dataframe:

Updated matrix:                                       Updated data frame:                                                                                                                         
                                                      Row:  PositionNumber  HighestScore
(A vs A) (A vs B) (A vs C) (A vs D)  (A vs E)         A          2            (A vs B)
(A vs B) (B vs B) (B vs C) (B vs D)  (B vs E)         B          5            (B vs E)       
(A vs C) (B vs C)   NA        NA        NA            C          NA              NA         
(A vs D) (B vs D)   NA        NA        NA            D          NA              NA 
(A vs E) (B vs E)   NA        NA        NA            E          NA              NA 

The idea behind doing this is less computations power would be needed to calculated scores is between two rows, the score also applies to its reciprocal (A vs B and B vs A) for example.

Another idea I have: after returning the value into the data frame, would it make the process faster to delete the first column from the matrix since it's no longer needed? (I imagine a 500,000 by 500,000 matrix would hold a lot of memory).

I hope this makes sense - where do I begin with writing a for loop for this situation? I am quite new to R.

Thanks

Majid
  • 1,836
  • 9
  • 19
Dieu94
  • 371
  • 1
  • 11
  • Good idea! The matrix you describe is usually called a *distance matrix*. The function `dist` will create a matrix like that for a variety of distances on numerics. If your inputs are strings, then the `stringdist` package has a function `stringdistmatrix` for you. – Gregor Thomas Sep 26 '19 at 02:43
  • You aren't quite right about deleting columns from the matrix as you go. It's true, it would free up memory, but in a memory and processor intensive way. 99% of the computational challenge is in creating the distance matrix. Getting the max entry from each row is quick (even quicker if you go by columns, because R stores matrices as columns, rather than as rows). Deleting the column after you're done with it will free up memory, but it will take enough time to do that, unless you *need* that memory immediately, it won't be worth doing. – Gregor Thomas Sep 26 '19 at 02:46
  • Ooh thank you I'll have a look at it! @Gregor My issue is though, I'm looking for the Levenshtein distance between multiple variables (Surname, firstname, address...etc). Would I have to do multiple distance matrices then? – Dieu94 Sep 26 '19 at 06:04
  • Yeah. Get a distance matrix for each of those. Then you could combine them somehow into an overall score. Adding them up would be one way, but probably not great, as perhaps address matches are a better indicator of a match than first name matches. If you have a subset where you are confident of the right answer, you could fit a model to calculate weights for how much a match in each category matters. – Gregor Thomas Sep 26 '19 at 13:29
  • If you want to be really fancy, you could use some sort of weighting based on the uniqueness of the name/match, perhaps divide each row by its average (maybe on the log scale). Somehow it would be nice to control for common names being common, and not neccesarily a good match, while matches of uncommon names are stronger. I'm just guessing here---if you look up probabilistic matching I'm sure there's lots of well-established methods. – Gregor Thomas Sep 26 '19 at 13:30

0 Answers0