R: agrep results quantifier

Question

Is there a built-in way to quantify results of agrep function? E.g. in

agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T)
[1] "tesr" "teqr"

tesr is only 1 char permutation away from test, while teqr is 2, and toar is 3 and hence not found. Apparently, tesr has higher "probability" than teqr. How can it be retrieved either in number of permutations or percentage? Thanks!

Edit: Apologies for not putting this in question in first place. I am already running a two-step procedure: agrep to get my list, and then adist to get N permutations. adist is slower, running time is a big factor in my dataset

@ MrFlick. Yes, I use this one as well, but this means that I need to run 2 functions. Also, `adist` gives N permutations for all possible cases, while I need to limit to `2` in this example — Alexey Ferapontov, Oct 26 '15 at 14:44
So what can't you get out of the `adist` output? Seems like you should be able to use that to subset your vector of interest just as `grep()` does. — MrFlick, Oct 26 '15 at 14:46
you can use `Filter(function(x) x<=2, adist("test", c("tesr", "teqr", "toar")))` — Colonel Beauvel, Oct 26 '15 at 14:47
Speed is the problem. `agrep` on my data takes 2 seconds per one pattern, and `adist` is a lot slower. I have 40k patterns — Alexey Ferapontov, Oct 26 '15 at 14:49
@ColonelBeauvel, `Filter...` won't give me the indices or values of those elements that have <= 2 permutations, while this is essential — Alexey Ferapontov, Oct 26 '15 at 14:52
if adist is slower why not do it in two steps - agrep, then adist: `words<- data.frame(x=c("tesr", "teqr", "toar")); words$y <- agrepl("test", words$x, 2); ifelse(words$y, adist("test", words$x), NA)` — jeremycg, Oct 26 '15 at 15:02
@jeremycg, what do you mean? I am already running `agrep` first then `adist`. Doesn't help with running time. Thus inquiring if `agrep` can have some useful output of stats kind — Alexey Ferapontov, Oct 26 '15 at 15:06

Steven Beaupré · Answer 1 · 2015-10-26T15:41:53.927

4

Another option using adist():

s <- c("tesr", "teqr", "toar")
s[adist("test", s) < 3]

Or using stringdist

library(stringdist)
s[stringdist("test", s, method = "lv") < 3]

Which gives:

#[1] "tesr" "teqr"

Benchmark

x <- rep(s, 10e5)
library(microbenchmark)
mbm <- microbenchmark(
  levenshteinDist = x[which(levenshteinDist("test", x) < 3)],
  adist = x[adist("test", x) < 3],
  stringdist = x[stringdist("test", x, method = "lv") < 3],
  times = 10
)

Which gives:

Unit: milliseconds
            expr       min        lq      mean    median        uq       max neval cld
 levenshteinDist  840.7897 1255.1183 1406.8887 1398.4502 1510.5398 1960.4730    10  b 
           adist 2760.7677 2905.5958 2993.9021 2986.1997 3038.7692 3472.7767    10   c
      stringdist  145.8252  155.3228  210.4206  174.5924  294.8686  355.1552    10 a

edited Oct 26 '15 at 15:41

answered Oct 26 '15 at 15:25

Steven Beaupré

21,343
7
57
77

1

Thanks. That is another possibility. Will consider as well. Eliminates the need for first `agrep` I suppose – Alexey Ferapontov Oct 26 '15 at 15:29
2

@AlexeyFerapontov If speed is a concern, I would go with `stringdist` – Steven Beaupré Oct 26 '15 at 15:42
1

Ohh! I haven't thought of it, although I was aware that `stringdist` is a lot faster than `agrep`. Just didn't put it together. – Alexey Ferapontov Oct 26 '15 at 15:44

score 3 · Accepted Answer · edited May 23 '17 at 11:44

3

The Levenshtein distance is the number of edits from one string to another. The package 'RecordLinkage' may be of interest. It provides the edit distance computation below, which should perform on par with agrep. Although it will not return the same results as agrep.

library(RecordLinkage)
ld <- levenshteinDist("test", c("tesr", "teqr", "toar"))
c("tesr", "teqr", "toar")[which(ld < 3)]

edited May 23 '17 at 11:44

Community

1
1

answered Oct 26 '15 at 15:07

vpipkt

1,710
14
17

Thanks. I guess I can work with that (slightly faster than `adist`) and still do a 2-step process. Pity that `agrep` doesn't have this functionality – Alexey Ferapontov Oct 26 '15 at 15:19
It is two steps but only one computationally intensive function call. – vpipkt Oct 26 '15 at 15:27
Yes. First `agrep` is not needed anymore. Thanks – Alexey Ferapontov Oct 26 '15 at 15:29

R: agrep results quantifier

2 Answers2