4

I am having trouble matching character strings. Most of the difficulty centers on abbreviation

I have two character vectors. I am trying to match words in vector A (typos) to the closes match in vector B.

vec.a <- c("ce", "amer", "principl")

vec.b <- c("ceo", "american", "principal")

My first crack at this was by using stringdist package fuzzy matching command. However, I can only push it so far.

amatch(vec.a, vec.b, maxDist = 3)
[1] 1 1 3

The amatch/fuzzy matching works wonderful for typos: in this case, ce -> ceo and principl -> principal. The problem arises with abbreviations. amer should be matched with american, but ce is a closer match--on account that less permutations are needed to match. How can I deal with fuzzy matching under the presence of abbreviations?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
YouLocalRUser
  • 309
  • 1
  • 9

2 Answers2

1

Maybe agrep is what the question is asking for.

vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")

sapply(vec.a, \(x){
    out <- agrep(x, vec.b)
    ifelse(length(out) > 0L, out, 0L)
})
#>       ce     amer principl 
#>        1        2        3

Created on 2022-03-07 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
1

Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler distance works for the example provided in your question.

library(stringdist)

vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")

amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
#> [1] 1 2 3
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
#> [1] 1 2 3
Till
  • 3,845
  • 1
  • 11
  • 18
  • This worked very well. Thank you. Still, when I tried it with my data only 55% of my data got matched. The rest was NA. I know that there should be way more matches. Is there any way I could spruce those matching numbers a little bit higher? – YouLocalRUser Mar 07 '22 at 18:10
  • That is hard to say without knowing your data. Can you add a few of those cases where it fails to your question? – Till Mar 07 '22 at 20:33