Similarity scores based on string comparison in R (edit distance)

Question

I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.

Have you explored `adist` and `agrep`? I'm not familiar with SPEDIS. — A5C1D2H2I1M1N2O1R2T1, Jul 18 '12 at 06:51

David Robinson · Accepted Answer · 2012-07-18T07:19:46.290

The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).

The levenshteinSim function in the RecordLinkage package also does this directly, and might be faster than adist.

library(RecordLinkage)
> levenshteinSim("apple", "apple")
[1] 1
> levenshteinSim("apple", "aaple")
[1] 0.8
> levenshteinSim("apple", "appled")
[1] 0.8333333
> levenshteinSim("appl", "apple")
[1] 0.8

ETA: Interestingly, while levenshteinDist in the RecordLinkage package appears to be slightly faster than adist, levenshteinSim is considerably slower than either. Using the rbenchmark package:

> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
                                         test replications elapsed relative
1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1
  user.self sys.self user.child sys.child
1     3.583    0.452          0         0
> benchmark(adist("applesauce", "aaplesauce"), replications=100000)
                               test replications elapsed relative user.self
1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707
  sys.self user.child sys.child
1    0.461          0         0
> benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
                                        test replications elapsed relative
1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1
  user.self sys.self user.child sys.child
1      6.49    0.743          0         0

This overhead is due simply to the code for levenshteinSim, which is just a wrapper around levenshteinDist:

> levenshteinSim
function (str1, str2) 
{
    return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), 
        nchar(str2))))
}

FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max instead of pmax and shave ~25% off the running time:

mylevsim = function (str1, str2) 
{
    return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), 
        nchar(str2))))
}
> benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
                                  test replications elapsed relative user.self
1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987
  sys.self user.child sys.child
1    0.627          0         0

Long story short- there is little difference between adist and levenshteinDist in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.

Hi, Yes the function is helpful. Also, is it possible to use this function directly in an sql query. I am using sqldf package to write an sql query and assign its result to a data frame in R.example title_score<-sqldf("select a.id as mp_id, b.id as sp_id, case when levenshteinSim(a.title,b.title) between 0 and 100 then ((100-levenshteinSim(a.title,b.title))/100)*c.weights else 0 end as title_score from allproducts a join allproducts b on a.subcategory_id=b.subcategory_id and a.id > b.id join filterweights c on b.subcategory_id=c.subcategory and c.filter_name='title'"); — Kunal Batra, Jul 18 '12 at 12:20
*Package ‘RecordLinkage’ was removed from the CRAN repository.* ... *Archived on 2020-02-02 as no maintainer to correct check errors.* — s_baldur, Feb 20 '20 at 15:48
if I'm currently obliged to use R 3.6.0, would you be able to suggest a similar package? Unfortunately both adist and RecordLinkage require R 3.6.3. Thank you — Angelo, Aug 12 '21 at 14:33
@Angelo Yep, stringdist! https://cran.r-project.org/web/packages/stringdist/index.html — David Robinson, Aug 13 '21 at 01:28

Similarity scores based on string comparison in R (edit distance)

1 Answers1

Linked

Related