Looping over different column names to condition rows in data table

Question

The following is an example that has been scaled down. I am looking for an answer that corrects the following syntax, rather than a "work-around."

set.seed(1)
dt = data.table(sum1=rnorm(10,0,1),sum2=rnorm(10,2,1))
catsummax = c(0,3)


df
      sum1       sum2
 1: -0.6264538  3.5117812
 2:  0.1836433  2.3898432
 3: -0.8356286  1.3787594
 4:  1.5952808 -0.2146999
 5:  0.3295078  3.1249309
 6: -0.8204684  1.9550664
 7:  0.4874291  1.9838097
 8:  0.7383247  2.9438362
 9:  0.5757814  2.8212212
10: -0.3053884  2.5939013



for(i in 1:2){
    dt=dt[paste0('sum',i)<=catsummax[i]]
}

This drops everything though because data table hates the syntax.

This only keeps rows that, in the first column are <= the first cat sum element, then, only keeps rows that also satisfy the second column's cat max, so it should give out:

df
          sum1       sum2
 3: -0.8356286  1.3787594
 6: -0.8204684  1.9550664
10: -0.3053884  2.5939013

There should be a more elegant way, but this avoids loop: `dt[which(colSums(t(dt) <= catsummax) == length(catsummax))]` — pogibas, Apr 11 '18 at 05:39
Chinsoon12, answer for a solution and best with an explanation of get and the syntax how its related to data table — wolfsatthedoor, Apr 11 '18 at 05:41
there should be a better way to do this...so far i can only come up with `dt[apply(sweep(dt, 2, catsummax, function(x, y) x <= y), 1, all)]`. lets wait 24hrs. — chinsoon12, Apr 11 '18 at 05:53
No chinsoon, I asked it the way I did because the data table above is just a selected few columns. I am looking precisely just to correct the syntax as I said, not find a more efficient workaround — wolfsatthedoor, Apr 11 '18 at 05:58

akrun · Answer 1 · 2018-04-11T06:21:03.853

2

We could use the Map to do the corresponding comparison and Reduce it to a single logical vector for subsetting the rows

dt[, .SD[Reduce(`&`,  Map(`<=`, .SD, catsummax))]]

Or

dt[dt[, .I[Reduce(`&`,  Map(`<=`, .SD, catsummax))]]]
#         sum1     sum2
#1: -0.8356286 1.378759
#2: -0.8204684 1.955066
#3: -0.3053884 2.593901

Update

If there are other columns in the dataset, specify the columns of interest in the .SDcols

dt[, .SD[Reduce(`&`,  Map(`<=`, .SD, catsummax)), .SDcols = sum1:sum2]]

edited Apr 11 '18 at 06:21

answered Apr 11 '18 at 06:11

akrun

874,273
37
540
662

Again the problem is that this assumes the entire data table is the one provided, these are just two columns I want to use for selecting rows of a bigger data table – wolfsatthedoor Apr 11 '18 at 06:16
1

@robertevansanders It was only bcz of your example. The code remains the same except that you specify `.SDcols`. It is not a workaround and is a standard data.table syntax – akrun Apr 11 '18 at 06:21
Imagine instead you had an arbitrary number of columns named sum1...sumJ, you need some way of creating this list of names to call, this is why I wanted it the way I posted – wolfsatthedoor Apr 11 '18 at 06:49
@robertevansanders You can use `grep("sum\\d+", names(dt), value = TRUE)` and use that as `.SDcols` – akrun Apr 11 '18 at 10:00
Why is that preferred, akrun? – wolfsatthedoor Apr 11 '18 at 14:56
@robertevansanders You meant preferred with respect to `get` ? – akrun Apr 11 '18 at 16:39
Why is your grep method with SDcols "better" syntax than get? – wolfsatthedoor Apr 11 '18 at 18:02

LachlanO · Accepted Answer · 2018-04-11T06:07:14.663

1

You just need to call the object correctly in R for data tables. Data tables are a bit like collections of objects. To call them using a string we use the get function. The get function basically just grabs the data table entry by object name, using a string. Data tables usually use unquoted strings to do this.

For your exact code, just use

for(i in 1:2){
  dt=dt[get(paste0('sum',i))<=catsummax[i],]
}

And it will work like a charm.

edited Apr 11 '18 at 06:07

answered Apr 11 '18 at 05:55

LachlanO

1,152
8
14

FWIW: There might be a performance penalty with this approach for large data.tables and many conditions as the data object is copied for each loop iteration. – Uwe Apr 11 '18 at 13:57
So ideally you want a method to do it simultaneously yeah? – wolfsatthedoor Apr 11 '18 at 14:57

score 0 · Answer 3 · answered Apr 11 '18 at 08:10

Yes, got it although I have not quite understand the whole metaprogramming things like call, name, expression, substitute, parse, deparse, eval, quote, enquote

i <- paste(paste0("sum", 1:2, " <= ", catsummax), collapse=" & ")
dt[eval(parse(text=i))]

Inspiration from: R data.table join: SQL "select *" alike syntax in joined tables?

Uwe · Answer 4 · 2018-04-11T13:59:22.420

For the sake of completeness, there is also an alternative approach which uses a non-equi join:

as.data.table(as.list(catsummax))[dt, on = .(V1 >= sum1, V2 >= sum2), nomatch = 0L]

           V1       V2
1: -0.8356286 1.378759
2: -0.8204684 1.955066
3: -0.3053884 2.593901

The on parameter also can be specified as character vector, i.e.,

as.data.table(as.list(catsummax))[dt, on = c("V1>=sum1", "V2>=sum2"), nomatch = 0L]

which allows for a more flexible programmatic solution:

If catsummax would be specified as data.table rather than a vector with the same column names as the ones in dt to compare to, e.g.,

catsummax_DT <- data.table(sum1 = 0, sum2 = 3)

   sum1 sum2
1:    0    3

a flexible solution can be created which works on an arbitrary and selectable set of columns:

library(magrittr)
cond <- names(catsummax_DT) %>% sprintf("%s>=%s", ., .)
catsummax_DT[dt, on = cond, nomatch = 0L]

         sum1     sum2
1: -0.8356286 1.378759
2: -0.8204684 1.955066
3: -0.3053884 2.593901

(magrittr is just used for convenience and conciseness)

Looping over different column names to condition rows in data table

4 Answers4

Update