0

I have a list of university names input with spelling errors and inconsistencies. I need to match them against an official list of university names to link my data together.

I know fuzzy matching/join is my way to go, but I'm a bit lost on the correct method. Any help would be greatly appreciated.

d<-data.frame(name=c("University of New Yorkk", "The University of South
 Carolina", "Syracuuse University", "University of South Texas", 
"The University of No Carolina"), score = c(1,3,6,10,4))

y<-data.frame(name=c("University of South Texas",  "The University of North
 Carolina", "University of South Carolina", "Syracuse
 University","University of New York"), distance = c(100, 400, 200, 20, 70))

And I desire an output that has them merged together as closely as possible

matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina", 
"Syracuuse University","University of South Texas","The University of No Carolina"), 
correctmatch = c("University of New York", "University of South Carolina", 
"Syracuse University","University of South Texas", "The University of North Carolina"))

1 Answers1

1

I use adist() for things like this and have little wrapper function called closest_match() to help compare a value against a set of "good/permitted" values.

library(magrittr) # for the %>%

closest_match <- function(bad_value, good_values) {
  distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
    as.numeric() %>%
    setNames(good_values)

  distances[distances == min(distances)] %>%
    names()
}

sapply(d$name, function(x) closest_match(x, y$name)) %>%
  setNames(d$name)

University of New Yorkk The University of South\n Carolina               Syracuuse University 
"University of New York"     "University of South Carolina"           "University of New York" 
University of South Texas      The University of No Carolina 
"University of South Texas"     "University of South Carolina" 

adist() utilizes Levenshtein distance to compare similarity between two strings.

Nate
  • 10,361
  • 3
  • 33
  • 40
  • I'm very new to this format and I have a quick question. How do I use this matrix output on your final line of code? I wrote the sapply line to a new dataframe and there's na's. Did I mess something up? – thewrightowns Oct 30 '18 at 20:41
  • I would save the final line as the variable `decoder` and then call `d$name <- decoder[d$name]` to overwrite the current values with the new correct matches. – Nate Oct 31 '18 at 12:46