1

I'm just starting to learn data.table and working my way through the vignettes--although I'm simultaneously using it in a project. How do I replace some plyr syntax with data.table?

input <- data.table(ID = c(37, 45, 900), a1 = c(1, 2, 3), a2 = c(43, 320,390), 
                      b1 = c(-0.94, 2.2, -1.223), b2 = c(2.32, 4.54, 7.21), c1 = c(1, 2, 3), 
                      c2 = c(-0.94, 2.2, -1.223))

# simple user defined function that conveys my problem
 func <- function(x, num) {
  x <- data.table(x)
  new_b <- x$b1[1]
  x2 <- within(x[1,], {
    b1 = new_b
    b2 = 51
  })
  imp <- rbindlist(replicate(num, x2, simplify= FALSE))
  return(rbindlist(list(x, imp)))
}

# wrapper function
wrap_func <- function(dat, num= 5, plyr= FALSE) {
if (plyr == TRUE) {
    return(plyr::ddply(dat, .var= "ID", .fun= func, num= num))
  } else {
    return(dat[, lapply(.SD, FUN= func, num), by= ID])
  }
}

plyr works

wrap_func(dat=input, 5, plyr=TRUE)

what is the data.table syntax?

wrap_func(dat=input, num=5, plyr=FALSE) # gives error

Thanks in advance!!

Update:

Based on @Frank's suggestion in the comments, I benchmarked this on my real data / code. Here, impute_zero_resp_all is the real equivalent of wrap_func in the example.

I start with a dataset that has ~50k rows and 1800 groups; imputation is done by group resulting in a dataset with ~170k rows and the same 1800 groups:

vec1 <- vec2 <- vector(mode= "numeric", length= 50)
for (i in 1:50) {
  vec1[i] <- system.time(impute_zero_resp_all(dat= test_dat2))[3] #DT
  vec2[i] <- system.time(impute_zero_resp_all2(dat= test_dat2))[3] #PLYR 
}

summary(vec1); summary(vec2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.62   22.76   22.81   22.84   22.84   23.72 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27.19   27.35   27.40   27.49   27.45   30.07

quantile(vec1, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
22.620 22.670 22.728 22.760 22.786 22.810 22.824 22.840 22.870 22.917 23.720 
quantile(vec2, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
27.190 27.289 27.330 27.357 27.376 27.400 27.424 27.440 27.476 27.522 30.070

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
  • Im not sure it has what youre after but check out this answer for `dplyr` to `data.table` comparisons http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly – Rorschach Aug 04 '15 at 21:29
  • Thanks! Yes, I currently have that on my to-read list re: `data.table`. The comprehensiveness is great, but the length makes it hard to get a quick answer. I realize that I have much to learn; but this seems like it should be a quick bit of syntax. – alexwhitworth Aug 04 '15 at 21:35
  • This doesn't use your custom function, but I think you want something like `input[,rbind(.SD,copy(.SD)[,b2:=51][rep(1,5)]),by=ID]` – Frank Aug 04 '15 at 21:48
  • @Frank I'm not sure if that solution will work in my real use-case since my actual user-defined function is more complicated than `func`. It appears you have provided a workaround that gives the same functionality in this specific case but not a general solution. I'm looking for a solution that **explicitly** uses the example `func` – alexwhitworth Aug 04 '15 at 22:15
  • 1
    @Alex Okay. I didn't think to try this before, but `input[, func(.SD,5), by=ID]` works. I think the `func` above is very suboptimal, though, and probably shouldn't exist/be used for this purpose. The same may be true of the function you have in mind for your actual use-case. – Frank Aug 04 '15 at 22:41
  • @Frank Thanks! Can you elaborate on **what makes / why** you think `func` is suboptimal? – alexwhitworth Aug 04 '15 at 22:45
  • Sure, probably better to talk further in the chat room http://chat.stackoverflow.com/rooms/25312/r-public so as not to clutter up comments here – Frank Aug 04 '15 at 22:48

0 Answers0