I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.
-
Have you explored `adist` and `agrep`? I'm not familiar with SPEDIS. – A5C1D2H2I1M1N2O1R2T1 Jul 18 '12 at 06:51
1 Answers
The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).
The levenshteinSim
function in the RecordLinkage package also does this directly, and might be faster than adist
.
library(RecordLinkage)
> levenshteinSim("apple", "apple")
[1] 1
> levenshteinSim("apple", "aaple")
[1] 0.8
> levenshteinSim("apple", "appled")
[1] 0.8333333
> levenshteinSim("appl", "apple")
[1] 0.8
ETA: Interestingly, while levenshteinDist
in the RecordLinkage package appears to be slightly faster than adist
, levenshteinSim
is considerably slower than either. Using the rbenchmark package:
> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinDist("applesauce", "aaplesauce") 100000 4.012 1
user.self sys.self user.child sys.child
1 3.583 0.452 0 0
> benchmark(adist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 adist("applesauce", "aaplesauce") 100000 4.277 1 3.707
sys.self user.child sys.child
1 0.461 0 0
> benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinSim("applesauce", "aaplesauce") 100000 7.206 1
user.self sys.self user.child sys.child
1 6.49 0.743 0 0
This overhead is due simply to the code for levenshteinSim
, which is just a wrapper around levenshteinDist
:
> levenshteinSim
function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1),
nchar(str2))))
}
FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max
instead of pmax
and shave ~25% off the running time:
mylevsim = function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/max(nchar(str1),
nchar(str2))))
}
> benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 mylevsim("applesauce", "aaplesauce") 100000 5.608 1 4.987
sys.self user.child sys.child
1 0.627 0 0
Long story short- there is little difference between adist
and levenshteinDist
in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.

- 77,383
- 16
- 167
- 187
-
Hi, Yes the function is helpful. Also, is it possible to use this function directly in an sql query. I am using sqldf package to write an sql query and assign its result to a data frame in R.example title_score<-sqldf("select a.id as mp_id, b.id as sp_id, case when levenshteinSim(a.title,b.title) between 0 and 100 then ((100-levenshteinSim(a.title,b.title))/100)*c.weights else 0 end as title_score from allproducts a join allproducts b on a.subcategory_id=b.subcategory_id and a.id > b.id join filterweights c on b.subcategory_id=c.subcategory and c.filter_name='title'"); – Kunal Batra Jul 18 '12 at 12:20
-
1*Package ‘RecordLinkage’ was removed from the CRAN repository.* ... *Archived on 2020-02-02 as no maintainer to correct check errors.* – s_baldur Feb 20 '20 at 15:48
-
if I'm currently obliged to use R 3.6.0, would you be able to suggest a similar package? Unfortunately both adist and RecordLinkage require R 3.6.3. Thank you – Angelo Aug 12 '21 at 14:33
-
@Angelo Yep, stringdist! https://cran.r-project.org/web/packages/stringdist/index.html – David Robinson Aug 13 '21 at 01:28