filter by using %like% between two columns of the data table

Question

Hello stackoverflowers,

I wonder if I could use the %like% operator row-wise in the datatable between two columns of the same datatable.

The following reproducible example will make it more clear.

First prepare the data

library(data.table)

iris <- as.data.table(iris)
iris <- iris[seq.int(from = 1, to = 150,length.out = 5)]
iris[, Species2 := c('set', "set|vers", "setosa", "nothing" , "virginica")]

Hence the dataset looks as follows.

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  Species2
1:          5.1         3.5          1.4         0.2     setosa       set
2:          4.9         3.6          1.4         0.1     setosa  set|vers
3:          6.4         2.9          4.3         1.3 versicolor    setosa
4:          6.4         2.7          5.3         1.9  virginica   nothing
5:          5.9         3.0          5.1         1.8  virginica virginica

I would like to use something like the following command row-wise.

iris[Species%like%Species2]

but it does not understand that I want it row-wise. Is that possible? The result should be the 1,2,5 rows.

LyzandeR · Accepted Answer · 2017-12-11T14:42:18.693

2

One way would be to group by row:

iris[, .SD[Species %like% Species2], by = 1:5]
#   : Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  Species2
#1: 1          5.1         3.5          1.4         0.2    setosa       set
#2: 2          4.9         3.6          1.4         0.1    setosa  set|vers
#3: 5          5.9         3.0          5.1         1.8 virginica virginica

Or as per @docendodiscimus 's comment, in case there are duplicate entries, you can do:

iris[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)]

edited Dec 11 '17 at 14:42

answered Dec 11 '17 at 14:31

LyzandeR

37,047
12
77
87

1

In case there are duplicate entries, I'd go for `iris[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)]` instead of by-row grouping – talat Dec 11 '17 at 14:39
1

Very nice solution – pogibas Dec 11 '17 at 14:40

Mike H. · Answer 2 · 2017-12-11T21:31:46.213

You can't pass a vector to the pattern argument of %like% since it calls upon grepl/grep and these aren't vectorized. You could use mapply to call %like% for each row to get what you want:

iris[mapply(function(x,y) x %like% y, Species, Species2) ]

#   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  Species2
#1:          5.1         3.5          1.4         0.2    setosa       set
#2:          4.9         3.6          1.4         0.1    setosa  set|vers
#3:          5.9         3.0          5.1         1.8 virginica virginica

Microbenchmark mainly for my own curiosity, but for anyone else interested:

set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 6), collapse = "")),
                 Species2 = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")))

microbenchmark::microbenchmark( mapply = dt[mapply(function(x,y) x %like% y, Species, Species2) ],
                                by_group1 = dt[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)],
                                by_group2 = dt[, .SD[Species %like% Species2], by = 1:nrow(dt)],
                                str_detect = dt[stri_detect_regex(Species, Species2)],
                                by_species2 = dt[,.SD[Species %like% Species2], by = Species2],
                                by_species2I = dt[dt[, .I[Species %like% Species2], by = Species2]$V1],
                                times = 5)
Unit: milliseconds
         expr        min         lq       mean     median         uq        max neval
       mapply   669.9691   680.2241   700.3758   685.8262   715.8373   750.0224     5
    by_group1 10906.2179 10908.0985 10951.5651 10914.7002 11009.0683 11019.7408     5
    by_group2 16738.4390 16826.4793 16907.8428 16902.9490 16970.6143 17100.7324     5
   str_detect   430.7768   431.1002   432.2279   431.9284   433.3488   433.9855     5
  by_species2  2482.7583  2518.6858  2547.5882  2531.4913  2599.0159  2605.9899     5
 by_species2I   110.1486   114.6775   115.9223   117.5270   118.5033   118.7553     5

Only ran it 5 times since the by_group* operations were so slow. Looks like @eddi's method using .I is that fastest (assuming I have his intended method correct).

Also, re-ran the benchmark using fewer groups, it seems in this case the by_species2I is still the fastest, and the other by_group* are still slowest by a lot (makes sense since the # of groups for by_group2 is always the data size and for by_group1 it's going to be close to the data size).

set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")),
                 Species2 = replicate(100000, paste0(sample(LETTERS, 2), collapse = "")))

Unit: milliseconds
         expr         min          lq       mean      median          uq         max neval
       mapply   611.83085   617.60180   639.7778   638.49061   652.80619   678.15932     5
    by_group1 10021.48177 10121.00419 10145.6305 10123.01354 10213.37976 10249.27339     5
    by_group2 15828.21224 15997.56034 16018.9583 16066.07284 16101.40961 16101.53651     5
   str_detect   416.44549   419.83585   420.6042   421.69423   421.85359   423.19194     5
  by_species2   106.06793   114.02764   115.5364   117.62331   118.04524   121.91770     5
 by_species2I    14.22369    14.72001    15.2137    15.24514    15.38371    16.49597     5

@eddi, fair - updated. The `data.table` group-by's are still slowest by quite a large margin. — Mike H., Dec 11 '17 at 19:21
Re by_species2, did you read the link in eddi's answer? The `.I` variant of it should be faster. — Frank, Dec 11 '17 at 19:30
@Frank I did read that but didn't get around to including it. I'll update shortly — Mike H., Dec 11 '17 at 20:31
Thanks for the benchmarks. That's amazing how fast `data.table` group-by works. I didn't expect it to beat the vectorized `stringi` function. It makes sense group-by using `.SD` is slower, since the package's FAQ vignette warns certain optimizations can't be done if `.SD` is specified. — Nathan Werth, Dec 12 '17 at 14:24

Nathan Werth · Answer 3 · 2017-12-11T14:40:47.417

%like% is just a wrapper around grepl, so the pattern (right-hand side) can only be length 1. You should be seeing a warning about this.

The stringi package lets you vectorize the pattern argument.

library(stringi)

iris[stri_detect_regex(Species, Species2)]

If you like the operator style instead of the function, you can make your own:

`%vlike%` <- function(x, y) {
  stri_detect_regex(x, y)
}

iris[Species %vlike% Species2]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  Species2
# 1:          5.1         3.5          1.4         0.2    setosa       set
# 2:          4.9         3.6          1.4         0.1    setosa  set|vers
# 3:          5.9         3.0          5.1         1.8 virginica virginica

score 2 · Answer 4 · answered Dec 11 '17 at 16:57

2

The various answers have the right bits and pieces - do it by Species2:

iris[, .SD[Species %like% Species2], by = Species2]

And if that's slow - use .I instead.

answered Dec 11 '17 at 16:57

eddi

49,088
6
104
155

filter by using %like% between two columns of the data table

4 Answers4