82

I have a function that returns two values in a list. Both values need to be added to a data.table in two new columns. Evaluation of the function is costly, so I would like to avoid having to compute the function twice. Here's the example:

library(data.table)
example(data.table)
DT
   x y  v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1  4
5: b 3  5
6: b 6  6
7: c 1  7
8: c 3  8
9: c 6  9

Here's an example of my function. Remember I said it's costly compute, on top of that there is no way to deduce one return value from the other given values (as in the example below):

myfun <- function (y, v) 
{
ret1 = y + v
ret2 = y - v
return(list(r1 = ret1, r2 = ret2))
}

Here's my way to add two columns in one statement. That one needs to call myfun twice, however:

DT[,new1:=myfun(y,v)$r1][,new2:=myfun(y,v)$r2]

   x y  v new1 new2
1: a 1 42   43  -41
2: a 3 42   45  -39
3: a 6 42   48  -36
4: b 1  4    5   -3
5: b 3  5    8   -2
6: b 6  6   12    0
7: c 1  7    8   -6
8: c 3  8   11   -5
9: c 6  9   15   -3

Any suggestions on how to do this? I could save r2 in a separate environment each time I call myfun, I just need a way to add two columns by reference at a time.

Florian Oswald
  • 5,054
  • 5
  • 30
  • 38
  • Why not have your function take in a data frame and return a data frame directly? `myfun <- function (y, v) { ret1 = y + v ret2 = y - v return(list(r1 = ret1, r2 = ret2)) } – Etienne Low-Décarie Jul 04 '12 at 18:55
  • 3
    @Etienne Because that copies the inputs to create a new output. Florian is using `data.table` for its memory efficiency with large datasets; it doesn't copy `x`,`y` or `v` at all, even once. Think 20GB datasets in RAM. – Matt Dowle Jul 05 '12 at 08:58

5 Answers5

107

Since data.table v1.8.3, you can do this:

DT[, c("new1","new2") := myfun(y,v)]

Another option is storing the output of the function and adding the columns one-by-one:

z <- myfun(DT$y,DT$v)
head(DT[,new1:=z$r1][,new2:=z$r2])
#      x y  v new1 new2
# [1,] a 1 42   43  -41
# [2,] a 3 42   45  -39
# [3,] a 6 42   48  -36
# [4,] b 1  4    5   -3
# [5,] b 3  5    8   -2
# [6,] b 6  6   12    0
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
flodel
  • 87,577
  • 21
  • 185
  • 223
  • 2
    wow, that second one is amazing, thanks! just ran it with `debug(myfun)` to see how often it gets called: it's once. great. – Florian Oswald Jul 03 '12 at 10:33
  • 1
    +10 from me too. I've just raised [FR#2120](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2120&group_id=240&atid=978) to "Drop needing `with=FALSE` for LHS of `:=`" – Matt Dowle Jul 03 '12 at 10:44
  • 7
    Note that list recycling is also done; e.g., `c("a","b","c","d"):=list(1,2)` puts 1 into `a` and `c`, and 2 into `b` and `d`. If any of the columns don't exist they'll be added by reference. Not sure how useful `:=` recycling is in practice. It's more for `c("a","b","c"):=NULL` which deletes those 3 columns. Internally that's a recycle of NULL to a (semantic) list length 3. – Matt Dowle Jul 03 '12 at 10:49
  • 1
    @MatthewDowle oh yes, just wanted to ask that. the `c("a","b","c"):=NULL` is very useful. – Florian Oswald Jul 03 '12 at 12:38
  • 8
    another useful `:=` usage can be `\`:=\`(colname=colvalue,...)`. I often prefer this one because you might just replace `:=` with `list` to have a read-only preview of data to be written by reference when `:=` used. – jangorecki Jan 16 '15 at 10:53
  • Note that `dt[, c(...):=fun(...)]` works, but `dt[, c(...)=fun(...)]`doesn't. But you can just to `dt[, fun(...)]` and then `setnames(dt, c('V1', 'V2', ...), c('name1', 'name2', ...))` – naught101 Mar 26 '15 at 04:17
  • @jangorecki, I think your solution should be the preferred method. It is just that the question is old, and maybe this solution did not exist back then. Please add as an answer, so we can upvote it – LucasMation Jan 04 '17 at 12:50
  • The second syntax is clearly better than chaining n assignments, esp. for arbitrarily long function with n return values. Suggest you move it to the top. – smci May 04 '18 at 06:16
  • I came here looking for an analog to plyr's ddply. This answer is not that! It passes the whole columns y and v to myfun, instead of doing it row-by-row. See Vasco's answer for a solution, https://stackoverflow.com/a/44266413/553434. – rumtscho May 26 '21 at 10:15
4

The answer can not be used such as when the function is not vectorized.

For example in the following situation it will not work as intended:

myfun <- function (y, v, g) 
{
  ret1 = y + v + length(g)
  ret2 = y - v + length(g)
  return(list(r1 = ret1, r2 = ret2))
}
DT
#    v y                  g
# 1: 1 1                  1
# 2: 1 3                4,2
# 3: 1 6              9,8,6

DT[,c("new1","new2"):=myfun(y,v,g)]
DT
#    v y     g new1 new2
# 1: 1 1     1    5    3
# 2: 1 3   4,2    7    5
# 3: 1 6 9,8,6   10    8

It will always add the size of column g, not the size of each vector in g

A solution in such case is:

DT[, c("new1","new2") := data.table(t(mapply(myfun,y,v,g)))]
DT
#    v y     g new1 new2
# 1: 1 1     1    3    1
# 2: 1 3   4,2    6    4
# 3: 1 6 9,8,6   10    8
Vasco
  • 395
  • 2
  • 13
2

To build on the previous answer, one can use lapply with a function that output more than one column. It's is then possible to use the function with more columns of the data.table.

 myfun <- function(a,b){
     res1 <- a+b
     res2 <- a-b
     list(res1,res2)
 }

 DT <- data.table(z=1:10,x=seq(3,30,3),t=seq(4,40,4))
 DT

 ## DT
 ##     z  x  t
 ## 1:  1  3  4
 ## 2:  2  6  8
 ## 3:  3  9 12
 ## 4:  4 12 16
 ## 5:  5 15 20
 ## 6:  6 18 24
 ## 7:  7 21 28
 ## 8:  8 24 32
 ## 9:  9 27 36
 ## 10: 10 30 40

 col <- colnames(DT)
 DT[, paste0(c('r1','r2'),rep(col,each=2)):=unlist(lapply(.SD,myfun,z),
                                                   recursive=FALSE),.SDcols=col]
 ## > DT
 ##     z  x  t r1z r2z r1x r2x r1t r2t
 ## 1:  1  3  4   2   0   4   2   5   3
 ## 2:  2  6  8   4   0   8   4  10   6
 ## 3:  3  9 12   6   0  12   6  15   9
 ## 4:  4 12 16   8   0  16   8  20  12
 ## 5:  5 15 20  10   0  20  10  25  15
 ## 6:  6 18 24  12   0  24  12  30  18
 ## 7:  7 21 28  14   0  28  14  35  21
 ## 8:  8 24 32  16   0  32  16  40  24
 ## 9:  9 27 36  18   0  36  18  45  27
 ## 10: 10 30 40  20   0  40  20  50  30
DJJ
  • 2,481
  • 2
  • 28
  • 53
0

In case a function return a matrix you can achieve the same behavior by wrapping the function with one converting the matrix into list first. I wonder if data.table should handle it automatically?

matrix2list <- function(mat){
unlist(apply(mat,2,function(x) list(x)),FALSE)
}

DT <- data.table(A=1:10)

myfun <- function(x) matrix2list(cbind(x+1,x-1))

DT[,c("c","d"):=myfun(A)]

##>DT
##      A  c d
##  1:  1  2 0
##  2:  2  3 1
##  3:  3  4 2
##  4:  4  5 3
##  5:  5  6 4
##  6:  6  7 5
##  7:  7  8 6
##  8:  8  9 7
##  9:  9 10 8
## 10: 10 11 9
DJJ
  • 2,481
  • 2
  • 28
  • 53
-5

Why not have your function take in a data frame and return a data frame directly?

myfun <- function (DT) 
{
DT$ret1 = with(DT, y + v)
DT$ret2 = with(DT, y - v)
return(DT)
}
Etienne Low-Décarie
  • 13,063
  • 17
  • 65
  • 87
  • 26
    Because that copies the whole of `DT`, twice. Florian is using `data.table` for its memory efficiency with large datasets; it doesn't copy `x`,`y` or `v` at all, even once. – Matt Dowle Jul 05 '12 at 08:55