2

I am a real beginner in R and I just have this two lists with names of cities in them. One list has user-generated names (people spell messy) and another list with the orthography of the names.

I tried using the package stringdist, and I ended up with a code that loops (for) and gives the closest match. But i could only input vectors, and I really need to use data frames.

This is my code (oh boy, it feels awkward):

 input <- "BAC"   #misspelled 
  correct <- c("ABC", "DEF", "GHI", "JKL") #list with all correct names
  shortest <- -1a

for (word in correct) {

  dist <- stringdist(input, word)
  #checks if it's a match!
  if (dist == 0){
    closest <- palavra
    shortest <- 0

    break

  }

  if(dist <= shortest || shortest < 0){
    closest <- word
    shortest <- dist

  }

}


if(shortest == 0){ 
  print("It's a match!")
} else {
  print(closest)
}

The ideia is to use this code to have an idea, I wanted to go from this to using stringdist in each row of my data frame. I don't even know if this is a good idea, if this would take too much processing power, don't feel afraid to say it's stupid. Thanks!

  • You've got `word` in your `stringdist` call but `palavra` in your `if` statement below it. Did you forget to translate "palavra," or is that an object somewhere else in your code? – camille May 03 '19 at 23:02
  • Possible duplicate of [agrep: only return best match(es)](https://stackoverflow.com/questions/5721883/agrep-only-return-best-matches) – camille May 03 '19 at 23:09
  • @camille yes, I did forget to translate... – Gabriel Rangel May 07 '19 at 17:58

1 Answers1

4

there is a special function for that in the stringdist package for that called amatch:

input <- "BAC"   #misspelled 
correct <- c("ABC", "DEF", "GHI", "JKL") 

correct[amatch(input, correct, maxDist = Inf)]
# "ABC"

this will also work for multiple input words at once, so no need to use a for-loop

input <- c("New Yorkk", "Berlyn", "Pariz") # misspelled 
correct <- c("Berlin", "Paris", "New York", "Los Angeles") # correct names

correct_words <- correct[amatch(input, correct, maxDist = Inf)]
data.frame(input, correct_words)

 #       input correct_words
 #   New Yorkk      New York
 #      Berlyn        Berlin
 #       Pariz         Paris
Daniel
  • 2,207
  • 1
  • 11
  • 15