7

How can I measure the degree to which names are similar in r? In other words, the degree to which a fuzzy match can be made.

For example, I am working with a data frame that looks like this:

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")

df1 <- data.frame(Name.1, Name.2)
df1
            Name.1             Name.2
1         gonzalez gonzalezsoldevilla
2 wassermanschultz            schultz
3   athanasopoulos    anthanasopoulos
4           armato             strain

It is clear from the data that rows 1 and 2 are similar enough to be confident that the name is the same. Row 3 is the same name even though it is misspelled and the fourth row is completely different.

As an output, I would like to create a third column that describes the degree of similarity between the names or returns a boolean of some kind to indicate a fuzzy match can be made.

Sharif Amlani
  • 1,138
  • 1
  • 11
  • 25

1 Answers1

8

There is in the package stringdist a function stingsim which gives you a number between 0 and 1 for similarities between strings.

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)

df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#>             Name.1             Name.2   similar
#> 1         gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz            schultz 0.4375000
#> 3   athanasopoulos    anthanasopoulos 0.9333333
#> 4           armato             strain 0.1666667
MarBlo
  • 4,195
  • 1
  • 13
  • 27