Fuzzy Match Across Columns in R

Question

How can I measure the degree to which names are similar in r? In other words, the degree to which a fuzzy match can be made.

For example, I am working with a data frame that looks like this:

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")

df1 <- data.frame(Name.1, Name.2)

df1
            Name.1             Name.2
1         gonzalez gonzalezsoldevilla
2 wassermanschultz            schultz
3   athanasopoulos    anthanasopoulos
4           armato             strain

It is clear from the data that rows 1 and 2 are similar enough to be confident that the name is the same. Row 3 is the same name even though it is misspelled and the fourth row is completely different.

As an output, I would like to create a third column that describes the degree of similarity between the names or returns a boolean of some kind to indicate a fuzzy match can be made.

score 8 · Accepted Answer · answered Jul 12 '20 at 08:22

8

There is in the package stringdist a function stingsim which gives you a number between 0 and 1 for similarities between strings.

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)

df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#>             Name.1             Name.2   similar
#> 1         gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz            schultz 0.4375000
#> 3   athanasopoulos    anthanasopoulos 0.9333333
#> 4           armato             strain 0.1666667

answered Jul 12 '20 at 08:22

MarBlo

4,195
1
13
27

This is fabulous! Thank you so much for this package! I appreciate the help. – Sharif Amlani Jul 12 '20 at 08:33
1

@Sharif Amlani you are welcome. You should thank the author of the package. – MarBlo Jul 12 '20 at 08:35
1

Excellent, I'll shoot him/her an email! – Sharif Amlani Jul 12 '20 at 08:36

Fuzzy Match Across Columns in R

1 Answers1