Difference of two character vectors with substring

Question

I have two lists:

a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")

I want to remove the elements from list b which would have a substring match with any of the values in a, e.g.

grepl("da","dada") # TRUE

How would you go about doing this efficiently?

score 10 · Accepted Answer · answered Oct 08 '15 at 12:56

10

We can paste the 'a' elements to a single string with | as the delimiter, use that as pattern in grepl, negate (!) to subset 'b'.

 b[!grepl(paste(a, collapse="|"), b)]

answered Oct 08 '15 at 12:56

akrun

874,273
37
540
662

1

I guess instead of !grepl one could also use the invert parameter. But: What if a containts regex characters, such as "."? – oliver13 Oct 08 '15 at 13:37
To answer my own question: I used http://stackoverflow.com/a/14838321/1563867 on the a-vector, but I'm not sure that's the best way to do it. – oliver13 Oct 08 '15 at 13:42
@oliver13 I guess you solved the problem. If not, consider to provide some example and expected output. – akrun Oct 08 '15 at 18:10

score 5 · Answer 2 · answered Oct 08 '15 at 13:34

And another solution using a simple for loop:

sel <- rep(FALSE, length(b))
for (i in seq_along(a)) {
  sel <- sel | grepl(a[i], b, fixed = TRUE)
}
b[!sel]

Not as elegant as some as the other solutions (especially the one by akrun), but showing that a for loop isn't always as slow in R as people believe:

fun1 <- function(a, b) {
  sel <- rep(FALSE, length(b))
  for (i in seq_along(a)) {
    sel <- sel | grepl(a[i], b, fixed = TRUE)
  }
  b[!sel]
}

fun2 <- function(a, b) {
  b[!apply(sapply(a, function(x) grepl(x,b, fixed=TRUE)),1,sum)]
}

fun3 <- function(a, b) {
  b[-which(sapply(a, grepl, b, fixed=TRUE), arr.ind = TRUE)[, "row"]]
}

fun4 <- function(a, b) {
  b[!grepl(paste(a, collapse="|"), b)]
}

library(stringr)
fun5 <- function(a, b) {
  b[!sapply(b, function(u) any(str_detect(u,a)))]
}

a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
b <- rep(b, length.out = 1E3)

library(microbenchmark)
microbenchmark(fun1(a, b), fun2(a, b), fun3(a,b), fun4(a,b), fun5(a,b))


# Unit: microseconds
#       expr       min        lq       mean    median         uq        max neval  cld
# fun1(a, b)   389.630   399.128   408.6146   406.007   411.7690    540.969   100 a   
# fun2(a, b)  5274.143  5445.038  6183.3945  5544.522  5762.1750  35830.143   100   c 
# fun3(a, b)  2568.734  2629.494  2691.8360  2686.552  2729.0840   2956.618   100  b  
# fun4(a, b)   482.585   511.917   530.0885   528.993   541.6685    779.679   100 a   
# fun5(a, b) 53846.970 54293.798 56337.6531 54861.585 55184.3100 132921.883   100    d

Yeah that microseconds benchmark is meaningless, you should create a bit bigger data set IMO — David Arenburg, Oct 08 '15 at 17:03

erasmortg · Answer 3 · 2015-10-08T13:27:49.333

You could try the following:

b[!(+(apply(sapply(a, function(x) grepl(x,b)),1,sum)) > 0)]
[1] "zyc" "ulk" "mae"

'Peeling' this previous call from the inside, the results are the following: First, obtain a matrix of matches from the grepl: call (with sapply):

sapply(a, function(x) grepl(x,b))
#        da    ba    cs    dd    ek
#[1,] FALSE FALSE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE  TRUE FALSE FALSE
#[5,] FALSE FALSE FALSE  TRUE FALSE
#[6,]  TRUE FALSE FALSE FALSE FALSE

Note that the columns are the elements of a and the rows are the elements of b.

Then, apply the function sum per rows (in R, TRUE is 1 and FALSE is 0:

apply(sapply(a, function(x) grepl(x,b)),1,sum)
#[1] 0 0 0 1 1 1

Note that here, the row sums might be > 1 (if there is more than 1 match), so it must be coerced into a logical with the previous call wrapped around:

+() > 0

With this, we can match ([) the indices of b, but since we want the opposite, we use the operator !.

#full code:
step.one <- sapply(a, function(x) grepl(x,b))
step.two <- apply(step.one,1,sum)
step.three <- +(step.two > 0)
step.four <- !step.three
#finally:
b[step.four]

As David shows in the comments, this is a much more elegant approach:

b[-which(sapply(a, grepl, b), arr.ind = TRUE)[, "row"]]

If you want to use `sapply` here, `b[-which(sapply(a, grepl, b), arr.ind = TRUE)[, "row"]]` would be probably better than combining it with `apply` — David Arenburg, Oct 08 '15 at 13:05
Or `b[rowSums(sapply(a, grepl, x=b))==0]` or since you know the length of output, use the faster `vapply`: `b[rowSums(vapply(a, grepl, x=b, logical(length(b)) ))==0]` — thelatemail, Oct 08 '15 at 13:33
This is pretty much what I was looking for - but clearly way more complicated than the grepl-way akrun suggested — oliver13, Oct 08 '15 at 13:43

Difference of two character vectors with substring

3 Answers3

Linked

Related