29

I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.

smci
  • 32,567
  • 20
  • 113
  • 146
Kunal Batra
  • 1,001
  • 3
  • 15
  • 23

1 Answers1

54

The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).

The levenshteinSim function in the RecordLinkage package also does this directly, and might be faster than adist.

library(RecordLinkage)
> levenshteinSim("apple", "apple")
[1] 1
> levenshteinSim("apple", "aaple")
[1] 0.8
> levenshteinSim("apple", "appled")
[1] 0.8333333
> levenshteinSim("appl", "apple")
[1] 0.8

ETA: Interestingly, while levenshteinDist in the RecordLinkage package appears to be slightly faster than adist, levenshteinSim is considerably slower than either. Using the rbenchmark package:

> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
                                         test replications elapsed relative
1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1
  user.self sys.self user.child sys.child
1     3.583    0.452          0         0
> benchmark(adist("applesauce", "aaplesauce"), replications=100000)
                               test replications elapsed relative user.self
1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707
  sys.self user.child sys.child
1    0.461          0         0
> benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
                                        test replications elapsed relative
1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1
  user.self sys.self user.child sys.child
1      6.49    0.743          0         0

This overhead is due simply to the code for levenshteinSim, which is just a wrapper around levenshteinDist:

> levenshteinSim
function (str1, str2) 
{
    return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), 
        nchar(str2))))
}

FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max instead of pmax and shave ~25% off the running time:

mylevsim = function (str1, str2) 
{
    return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), 
        nchar(str2))))
}
> benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
                                  test replications elapsed relative user.self
1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987
  sys.self user.child sys.child
1    0.627          0         0

Long story short- there is little difference between adist and levenshteinDist in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Hi, Yes the function is helpful. Also, is it possible to use this function directly in an sql query. I am using sqldf package to write an sql query and assign its result to a data frame in R.example title_score<-sqldf("select a.id as mp_id, b.id as sp_id, case when levenshteinSim(a.title,b.title) between 0 and 100 then ((100-levenshteinSim(a.title,b.title))/100)*c.weights else 0 end as title_score from allproducts a join allproducts b on a.subcategory_id=b.subcategory_id and a.id > b.id join filterweights c on b.subcategory_id=c.subcategory and c.filter_name='title'"); – Kunal Batra Jul 18 '12 at 12:20
  • 1
    *Package ‘RecordLinkage’ was removed from the CRAN repository.* ... *Archived on 2020-02-02 as no maintainer to correct check errors.* – s_baldur Feb 20 '20 at 15:48
  • if I'm currently obliged to use R 3.6.0, would you be able to suggest a similar package? Unfortunately both adist and RecordLinkage require R 3.6.3. Thank you – Angelo Aug 12 '21 at 14:33
  • @Angelo Yep, stringdist! https://cran.r-project.org/web/packages/stringdist/index.html – David Robinson Aug 13 '21 at 01:28