4

Is there a built-in way to quantify results of agrep function? E.g. in

agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T)
[1] "tesr" "teqr"

tesr is only 1 char permutation away from test, while teqr is 2, and toar is 3 and hence not found. Apparently, tesr has higher "probability" than teqr. How can it be retrieved either in number of permutations or percentage? Thanks!

Edit: Apologies for not putting this in question in first place. I am already running a two-step procedure: agrep to get my list, and then adist to get N permutations. adist is slower, running time is a big factor in my dataset

Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
  • 1
    Look at the `adist()` function – MrFlick Oct 26 '15 at 14:42
  • @ MrFlick. Yes, I use this one as well, but this means that I need to run 2 functions. Also, `adist` gives N permutations for all possible cases, while I need to limit to `2` in this example – Alexey Ferapontov Oct 26 '15 at 14:44
  • what is the problem by using two functions? – Colonel Beauvel Oct 26 '15 at 14:46
  • 1
    So what can't you get out of the `adist` output? Seems like you should be able to use that to subset your vector of interest just as `grep()` does. – MrFlick Oct 26 '15 at 14:46
  • 2
    you can use `Filter(function(x) x<=2, adist("test", c("tesr", "teqr", "toar")))` – Colonel Beauvel Oct 26 '15 at 14:47
  • Speed is the problem. `agrep` on my data takes 2 seconds per one pattern, and `adist` is a lot slower. I have 40k patterns – Alexey Ferapontov Oct 26 '15 at 14:49
  • @ColonelBeauvel, `Filter...` won't give me the indices or values of those elements that have <= 2 permutations, while this is essential – Alexey Ferapontov Oct 26 '15 at 14:52
  • 1
    if adist is slower why not do it in two steps - agrep, then adist: `words<- data.frame(x=c("tesr", "teqr", "toar")); words$y <- agrepl("test", words$x, 2); ifelse(words$y, adist("test", words$x), NA)` – jeremycg Oct 26 '15 at 15:02
  • @jeremycg, what do you mean? I am already running `agrep` first then `adist`. Doesn't help with running time. Thus inquiring if `agrep` can have some useful output of stats kind – Alexey Ferapontov Oct 26 '15 at 15:06

2 Answers2

4

Another option using adist():

s <- c("tesr", "teqr", "toar")
s[adist("test", s) < 3]

Or using stringdist

library(stringdist)
s[stringdist("test", s, method = "lv") < 3]

Which gives:

#[1] "tesr" "teqr"

Benchmark

x <- rep(s, 10e5)
library(microbenchmark)
mbm <- microbenchmark(
  levenshteinDist = x[which(levenshteinDist("test", x) < 3)],
  adist = x[adist("test", x) < 3],
  stringdist = x[stringdist("test", x, method = "lv") < 3],
  times = 10
)

Which gives: enter image description here

Unit: milliseconds
            expr       min        lq      mean    median        uq       max neval cld
 levenshteinDist  840.7897 1255.1183 1406.8887 1398.4502 1510.5398 1960.4730    10  b 
           adist 2760.7677 2905.5958 2993.9021 2986.1997 3038.7692 3472.7767    10   c
      stringdist  145.8252  155.3228  210.4206  174.5924  294.8686  355.1552    10 a  
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
3

The Levenshtein distance is the number of edits from one string to another. The package 'RecordLinkage' may be of interest. It provides the edit distance computation below, which should perform on par with agrep. Although it will not return the same results as agrep.

library(RecordLinkage)
ld <- levenshteinDist("test", c("tesr", "teqr", "toar"))
c("tesr", "teqr", "toar")[which(ld < 3)]
Community
  • 1
  • 1
vpipkt
  • 1,710
  • 14
  • 17