4

I am trying, in my quest to rewrite old (slow) code with the data.table package, to figure out the best way to use apply with a data.table.

I have a data.table with multiple id columns, then multiple columns that have dose-response data in a wide format. I need to generalize the answer because not all data.tables will have the same number of dose-response columns. For simplicity I think the following data.table addresses the issue:

library(data.table)
library(microbenchmark)
set.seed(1234)
DT1 =  data.table(unique_id = paste0('id',1:1e6),
                 dose1 = sample(c(1:9,NA),1e6,replace=TRUE),
                 dose2 = sample(c(1:9,NA),1e6,replace=TRUE)
                 )

> DT1
          unique_id dose1 dose2
       1:       id1     2     2
       2:       id2     7     4
       3:       id3     7     9
       4:       id4     7     4
       5:       id5     9     3
---                      
  999996:  id999996     4     3
  999997:  id999997    NA     3
  999998:  id999998     4     2
  999999:  id999999     8     5
 1000000: id1000000     6     7

So each row has a unique id, some other ids, and I have left out the response columns, because they will be NA where the dose columns are NA. What I need to do is remove rows where all of the dose columns are NA. I came up with the first option, then realized I could trim it down to the second option.

DT2 <- copy(DT1)
DT3 <- copy(DT1)

len.not.na <- function(x){length(which(!is.na(x)))}

option1 <- function(DT){
  DT[,flag := apply(.SD,1,len.not.na),.SDcols=grep("dose",colnames(DT))]
  DT <- DT[flag != 0]
  DT[ , flag := NULL ]
}

option2 <- function(DT){
  DT[ apply(DT[,grep("dose",colnames(DT)),with=FALSE],1,len.not.na) != 0 ]
}

> microbenchmark(op1 <- option1(DT2), op2 <- option2(DT3),times=25L)
Unit: seconds
                expr      min       lq   median       uq      max neval
 op1 <- option1(DT2) 8.364504 8.863436 9.145341 11.27827 11.50356    25
 op2 <- option2(DT3) 8.291549 8.774746 8.982536 11.15269 11.72199    25

Clearly they two options do about the same thing, with option 1 having a few more steps, but I wanted to test how calling .SD might slow things down as has been suggested by other posts (for example).

Either way both options are still on the slow side. Any suggestions to speeding things up?

EDIT with comment from @AnandaMahto

DT4 <- copy(DT1)
option3 <- function(DT){
  DT[rowSums(DT[,grep("dose",colnames(DT)),with=FALSE]) != 0]
}

> microbenchmark(op2 <- option2(DT3), op3 <- option3(DT4),times=5L)
Unit: milliseconds
               expr        min         lq    median        uq       max neval
op2 <- option2(DT3) 7738.21094 7810.87777 7838.6067 7969.5543 8407.4069     5
op3 <- option3(DT4)   83.78921   92.65472  320.6273  559.8153  783.0742     5

rowSums is definitely faster. I am happy with the solution unless anyone has something faster.

Community
  • 1
  • 1
dayne
  • 7,504
  • 6
  • 38
  • 56

3 Answers3

6

My approach would be as follows:

Use rowSums to find the rows you want to keep:

Dose <- grep("dose", colnames(DT1))
# .. menas "up one level
Flag <- rowSums(is.na(DT1[, ..Dose])) != length(Dose)
DT1[Flag]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4
DT1[!is.na(dose1) | !is.na(dose2)]

The Reduce generalization in previous edits was wrong, here's the correct version:

DT1[(!Reduce("*", DT1[, lapply(.SD, is.na), .SDcols = patterns("dose")]))]

Benchmarks

rowsum = function(dt) {
  Dose <- grep("dose", colnames(dt))
  Flag <- rowSums(is.na(dt[, ..Dose])) != length(Dose)
  dt[Flag]
}

reduce = function(dt) {
  dt[(!Reduce("*", dt[, lapply(.SD, is.na), .SDcols = patterns("dose")]))]
}

# original data
microbenchmark(rowsum(copy(DT1)), reduce(copy(DT1)), times = 10)
#Unit: milliseconds
#              expr      min       lq   median       uq      max neval
# rowsum(copy(DT1)) 184.4121 190.9895 238.2935 248.0654 266.5708    10
# reduce(copy(DT1)) 141.2399 172.2020 199.1012 219.4567 424.1526    10

# a lot more columns
for (i in 10:100) DT1[, paste0('dose', i) := sample(c(NA, 1:10), 1e6, T)]

microbenchmark(rowsum(copy(DT1)), reduce(copy(DT1)), times = 10)
#Unit: seconds
#              expr      min       lq   median       uq      max neval
# rowsum(copy(DT1)) 4.160035 4.428527 4.505705 4.754398 4.906849    10
# reduce(copy(DT1)) 3.421675 4.172700 4.507304 4.622355 5.156840    10

So at 100 columns Reduce still does all right.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
eddi
  • 49,088
  • 6
  • 104
  • 155
  • That's essentially looping in R (with `Reduce`), which is not really necessary, while we could use vectorised functions. – Arun Nov 19 '13 at 17:14
  • yep, it's looping over the columns (and is vectorized for each column) – eddi Nov 19 '13 at 17:16
  • Yes, but we could do better than that - vectorised over all columns :). `Reduce` doesn't scale well. – Arun Nov 19 '13 at 17:17
  • I do not really understand how this works, but it seems to be faster than the `rowSums` option - see my edit. @Arun Can you give an example of how to remove the `Reduce`? – dayne Nov 19 '13 at 17:20
  • As long as you've only a few columns, it doesn't matter if you use `Reduce` or not. – Arun Nov 19 '13 at 17:28
  • @Arun I realized that after I made the edit. I reran the comparison with 40 dose columns and this method was still faster than `rowSums` - albeit much harder to understand. – dayne Nov 19 '13 at 17:33
  • @Arun all right, I don't think you're wrong - it does start losing once you have more than 100 or so columns – eddi Nov 19 '13 at 17:39
  • @Arun - actually turns out that was an anomalous measurement - I don't have enough patience to see at what point reduce would lose (if ever) - it's still doing just fine at 100 columns – eddi Nov 19 '13 at 17:49
  • @eddi I am still having a really hard time understanding how `Reduce` works here. Can you clarify a little for my benefit? – dayne Nov 19 '13 at 19:37
  • @dayne `Reduce` is a way of stringing together repetitive expressions - if you want to e.g. do `x1+x2+x3+x4+...` you can do this by adding two elements at a time like so: `((((x1 + x2) + x3) + x4) + ...)` and `Reduce` formalizes this as: `Reduce(function(x, y) x + y, list(x1,x2,x3,...))`. Since `data.table` is a list, that's what I did above for a slightly more complicated function than just `+`. – eddi Nov 19 '13 at 19:53
  • @eddi I expected `Reduce(function(x, y) is.na(x) * is.na(y), rep(NA,5))` to return 1... but it does not. However, `Reduce("*",is.na(rep(NA,5)))` behaves like I expect. I'm just super confused. – dayne Nov 19 '13 at 19:59
  • @dayne that's because I made a mistake in the reduce expression :) - let me see if I can correct it – eddi Nov 19 '13 at 20:03
  • @eddi Interesting. It seems to work as is, despite the mistake. – dayne Nov 19 '13 at 20:05
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/41486/discussion-between-dayne-and-eddi) – dayne Nov 19 '13 at 20:07
  • @eddi, a few is at least 1000, I'd say. So, for columns it seems to me to be perfectly acceptable (at these data dimensions). If you want to test the performance create a list with say 1e3,1e4,1e5 rows (with random number of elements in each) and do an operation (and try the non-reduce way of accomplishing it). I'm trying to find a post where I benchmarked `reduce`.. I'll link it if I manage to find it. In my benchmarking, even at 100, the median times gets bigger than `rowSums` – Arun Nov 20 '13 at 04:28
0

Maybe easier to just select all the rows with no NAs into a new table like this. You can amend the 'which' condition depending on your table:

DT2<-(DT1[which(!is.na(DT1$dose1) & !is.na(DT1$dose2)),])
Troy
  • 8,581
  • 29
  • 32