1

This is a very vague question but I was wondering if there is some sort of function or package in R that merges or identifies similiar/identical observations in two or more columns (maybe even rates how each observation is similar to one another). I have two messy data sets that have some columns that have some unique identifiers but have a lot of spelling or small differences between the two. For example, you have a column like this:

c1 <- c("ELIZA 2A", "aaab", "Unique New York", "I slith the Sheeth", "fdasa", "Yellow leather")

c2<- c("ELIZA", "fjdkaldjlk", "Unique NY", "Slith Sheeth", "Y. Leather")

In this case, the 1st, 3rd, 4th and 6th elements in c1 are similar to the 1st, 3rd,4th, and 5th elements in c2. I would want some sort of function or algorithm that displays that, maybe shows how similar they are, and then merge them by either c1 or c2. These datasets have over 15,000 observations with even messier rows, but this was just an example. I hope that makes sense.

Thank you for your help!

Mr. Biggums
  • 197
  • 8

1 Answers1

1

We can use fuzzyjoin

library(fuzzyjoin)
stringdist_inner_join(df1, df2, by = c("c1" = "c2"))

As @gersht noted in the comments, select the method and max_dist appropriately to do the join

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    This doesn't work well, or at all, unless you also play around with `method` and `max_dist`. Setting `method = "cosine"` and `max_dist = .33` returned perfect join. –  Oct 10 '19 at 17:52
  • Thank you for this package, it's funny I'm actually doing this for clinical data. However, when I tried with my example it gave me an empty data frame :/ – Mr. Biggums Oct 10 '19 at 17:52
  • 1
    @Mr.Biggums read my comment to get things working. Akrun will probably edit when he/she gets the chance. –  Oct 10 '19 at 17:54
  • 2
    @gersht That is right. I was thinking about updating before your comment. But, I thought the OP was showing only fake data – akrun Oct 10 '19 at 17:55
  • Yeah it is only fake data, but I'm assuming I would have to tinker with method and max_dist more for my actual data set? – Mr. Biggums Oct 10 '19 at 17:57
  • 1
    @Mr.Biggums Yes, that is right, because it is based on the distance and sometimes by tweaking those parameters would change the output you get – akrun Oct 10 '19 at 17:58