1

I'm trying to write a function which takes a data.table, a list of columns and a list of values and selects rows such that each column is filtered by the respective value.

So, given the following data.table:

> set.seed(1)
> dt = data.table(sample(1:5, 10, replace = TRUE),
                  sample(1:5, 10, replace = TRUE),
                  sample(1:5, 10, replace = TRUE))
> dt
    V1 V2 V3
 1:  1  5  5
 2:  4  5  2
 3:  1  2  2
 4:  2  2  1
 5:  5  1  4
 6:  3  5  1
 7:  2  5  4
 8:  3  1  3
 9:  3  1  2
10:  1  5  2

A call to filterDT(dt, c(V1, V3), c(1, 2)) would select the rows where V1 = 1 and V3 = 2 (rows 3 and 10 above).

My best thought was to use .SD and .SDcols to stand in for the desired columns and then do a comparison within i (from dt[i,j,by]):

> filterDT <- function(dt, columns, values) {
      dt[.SD == values, , .SDcols = columns]
  }
> filterDT(dt, c("V1", "V3"), c(1, 2))
Empty data.table (0 rows and 3 cols): V1,V2,V3

Unfortunately, this doesn't work, even if only filtering by one column.

I've noticed all examples of .SD I've found online use it in j, which tells me I'm probably doing something very wrong.

Any suggestions?

Wasabi
  • 2,879
  • 3
  • 26
  • 48
  • 1
    The values in 'dt' and the ones based on the seed are not matching – akrun May 22 '19 at 19:07
  • Will a join solve your problem? I.e. supplying a data.table of conditions to the `i` argument to filter rows of `dt`: `dt[data.table(V1=1, V3=3), on=.(V1,V3)]` – Scott Ritchie May 22 '19 at 19:09
  • @ScottRitchie, this has to work with an arbitrary number of columns, so I don't know how to convert `(V1=1, V3=3)` to something like `columns = values` (where `columns = (V1, V3)` and `values = (1, 3)`). – Wasabi May 22 '19 at 19:14

2 Answers2

4

Assuming that the 'values' to be filtered are the ones corresponding to the 'columns' selected, we can do a comparison with Map and Reduce with &

dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]

As a function

filterDT <- function(dt, columns, values) {
       dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]
   }

filterDT(dt, c("V1", "V3"), c(1, 2))
#   V1 V2 V3
#1:  1  4  2

Or another option is setkey

setkeyv(dt, c("V1", "V3"))
dt[.(1, 2)]
#   V1 V2 V3
#1:  1  4  2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I've been using `setkey` just when I wanted to join tables. like here https://stackoverflow.com/a/38798400/6461462. While it should be obvious by definition of `key`, I never thought of this. Cheers. – M-- May 22 '19 at 20:24
  • 1
    @M-M See the vignette [Keys and fast binary search based subset](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html). Instead of setting keys, you can also use the `on` argument; see Examples in `?data.table` ("# fast ad hoc row subsets (subsets as joins")) and [comment above](https://stackoverflow.com/questions/56263213/data-table-selecting-by-comparing-list-of-columns-to-list-of-values#comment99143267_56263213) – Henrik May 22 '19 at 21:22
  • wrapping the setkey solution in a function as forseen by OP: `filterDT <- function(DT, columns, values) setkeyv(DT, columns)[values]` – s_baldur May 23 '19 at 10:25
4

I think you should be able to write a function that joins using an arbitrary number of columns:

#' Filter a data.table on an arbitrary number of columns
#'
#' @param dt data.table to filter
#' @param ... named columns to filter on and their values
filter_dt <- function(dt, ...) {
  filter_criteria <- as.data.table(list(...))
  dt[filter_criteria, on = names(filter_criteria), nomatch=0]
}

# A few examples:
filter_dt(dt, V1=1, V3=2)
filter_dt(dt, V1=2, V2=2, V3=5)
filter_dt(dt, V1=c(5,4,4), V3=c(1,2,5)) 

Basically the function constructs a new data.table from the arguments supplied to ..., each argument becoming a column in the new data.table filter_criteria. This is then supplied to the i argument of dt with the column names of filter_criteria used as the columns in the join.

Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64