How to measure similarity between strings?

Question

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.

For example:

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")

I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?

score 32 · Accepted Answer · edited May 23 '17 at 12:08

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3

$`Bush, G.W.`
[1] 2

$`Obama, B.H.`
[1] 1 3

$`Clinton, W.J.`
[1] 4

score 18 · Answer 2 · answered May 18 '11 at 12:15

Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.

lapply(pres, agrep, pres, value = TRUE)

[[1]]
[1] " Obama, B."  "Obama, B.H."

[[2]]
[1] "Bush, G.W."

[[3]]
[1] " Obama, B."  "Obama, B.H."

[[4]]
[1] "Clinton, W.J."

Paul Rougieux · Answer 3 · 2018-12-05T10:35:07.750

Add another duplicate to show it works with more than one duplicate.

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")

adist shows the string distance between 2 character vectors

adist(" Obama, B.", pres)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    9    3   10    7

For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:

d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."

To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().

keepunique <-  function(previousones, x){
    if(any(adist(x, previousones)<5)){
        x <- NULL
    }
    return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B."    "Bush, G.W."    "Clinton, W.J."

How to measure similarity between strings?

3 Answers3

Linked

Related