0

I would like to identify rows in a data frame that are highly similar to each other but not necessarily exact duplicates. I have considered merging all the data from each row into one string cell at the end and then using a partial matching function. It would be nice to be able to set/adjust the level of similarity required to qualify as a match (for example, return all rows that match 75% of the characters in another row).

Here is a simple working example.

df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))

In this scenario, I would want row 2 to show up as a duplicate of row 1, but not row 4 (It is too dissimilar). Thanks for any suggestions.

jogall
  • 651
  • 6
  • 21
Braden
  • 345
  • 5
  • 11
  • I don't know R, so i won't hazard an answer, but it sounds like you might want to look into the [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) - does R have an implementation? – dcsohl Jun 18 '14 at 18:16
  • 1
    @dcsohl yes. `agrep` use this distance for fuzzy search. – agstudy Jun 18 '14 at 18:21
  • That sounds like the direction I need to go. I see a few options in R here: http://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r – Braden Jun 18 '14 at 18:23

4 Answers4

1

You could use agrep (or agrepl) for partial (fuzzy) pattern matching.

> df[agrep("Andrew", df$name), ]
    name    id score
1 Andrew 12334    90
2 Andrem 12344    90

So this shows that rows 1 and 2 are both found when matching "Andrew" Then you could remove the duplicates (only taking the first "Andrew" match) with

> a <- agrep("Andrew", df$name)
> df[c(a[1], rownames(df)[-a]), ]
     name    id score
1  Andrew 12334    90
3    Adam 34345    83
4 Pamdrew 98974    95
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

You can use agrep But first you need to concatenate all columns to do the fuzzy search in all columns and not just the first one.

xx <- do.call(paste0,df)
df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),]
     name    id score
1  Andrew 12334    90
2  Andrem 12344    90
4 Pamdrew 98974    95

Note that for 0.7 you get all rows.

Once rows matched you should extract them from the data.frame and repeat the same process for other rows(row 3 here with the rest of data)...

agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    Your solution looks like what I need, but I would need a way to apply this process to thousands of rows of data quickly with the end result being a subset of rows that are similar to the row above or below it. – Braden Jun 18 '14 at 18:45
  • OK...I took your solution and used a loop to derive the final result I needed. My code is a little long, so instead of putting it in a comment I will add it as a solution. Not sure what protocol is here. Thank you for your help. – Braden Jun 18 '14 at 19:13
0

You could use some approximate string distance metric for the names such as:

adist(df$name)

     [,1] [,2] [,3] [,4]
[1,]    0    1    4    3
[2,]    1    0    3    4
[3,]    4    3    0    6
[4,]    3    4    6    0

or use a dissimilartiy matrix calculation:

require(cluster)
daisy(df[, c("id", "score")])

Dissimilarities :
      1     2     3
2    10            
3 22011 22001      
4 86640 86630 64629
Walter
  • 343
  • 2
  • 7
  • My solution above (which extends the answer by @agstudy) is pretty slow. Your solution might be faster if I could think of a way to subset my data set based on the size of these Levenshtein Distances. For example, I might want the rows of all combinations with a distance less than 2. – Braden Jun 19 '14 at 12:29
0

Extending the solution provided by agstudy (see comments above) I produced the following solution that produced a data frame with each similar row in a data frame next to each other.

df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew", "Adan"), id = c(12334, 12344, 34345, 98974, 34344), score = c(90, 90, 83, 95, 83))  
xx <- do.call(paste0,df)  ## concatenate all columns
df3<-df[0,] ## empty data frame for storing loop results
for (i in 1:nrow(df)){  ## produce results for each row of the data frame
df2<-df[agrep(xx[i],xx,max=0.3*nchar(xx[i])),] ##set level of similarity required (less than 30% dissimilarity in this case)
if(nrow(df2) >= 2){df3<-rbind(df3, df2)}  ## rows without matches returned themselves...this eliminates them
df3<-df3[!duplicated(df3), ]  ## store saved values in df3
}

I am sure there are cleaner ways of producing these results, but this gets the job done.

Braden
  • 345
  • 5
  • 11