18

I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:

     Time     Stock x1 x2 x3
1: 2014-08-22     A 15 27 34
2: 2014-08-23     A 39 44 29
3: 2014-08-24     A 20 50  5
4: 2014-08-22     B 42 22 43
5: 2014-08-23     B 44 45 12
6: 2014-08-24     B  3 21  2

Now I want to scale and sum many of the variables to get an output like:

         Time Stock x1 x2 x3   x2_scale   x3_scale x2_sum x3_sum
1: 2014-08-22     A 15 27 34 -1.1175975  0.7310560    121     68
2: 2014-08-23     A 39 44 29  0.3073393  0.4085313    121     68
3: 2014-08-24     A 20 50  5  0.8102582 -1.1395873    121     68
4: 2014-08-22     B 42 22 43 -0.5401315  1.1226726     88     57
5: 2014-08-23     B 44 45 12  1.1539172 -0.3274462     88     57
6: 2014-08-24     B  3 21  2 -0.6137858 -0.7952265     88     57

A brute force implementation of my problem would be:

library(data.table)

set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
                Stock = rep(LETTERS[1:2], each=3 ),
                x1 = sample(1:50, 6),
                x2 = sample(1:50, 6),
                x3 = sample(1:50, 6))

d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]

Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:

  d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
  d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]

But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).

Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?

Community
  • 1
  • 1
Pilik
  • 689
  • 1
  • 7
  • 18

4 Answers4

20

I think with a small modification to your last code you can easily do both for as many variables you want

vars <- c("x2", "x3") # <- Choose the variable you want to operate on

d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]

##          Time Stock x1 x2 x3   x2_scale   x3_scale x2_sum x3_sum
## 1: 2014-08-22     A 13 14 32 -1.1338934  1.1323092     87     44
## 2: 2014-08-23     A 25 39  9  0.7559289 -0.3701780     87     44
## 3: 2014-08-24     A 18 34  3  0.3779645 -0.7621312     87     44
## 4: 2014-08-22     B 44  8  6 -0.4730162 -0.7258662     59     32
## 5: 2014-08-23     B 49  3 18 -0.6757374  1.1406469     59     32
## 6: 2014-08-24     B 15 48  8  1.1487535 -0.4147807     59     32

For simple functions (that don't need special treatment like scale) you could easily do something like

vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
  d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock] 
}
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks everyone for the clever solutions! I accepted @DavidArenburg 's answer because it only uses `data.table`and because it's quite similar to my original code. – Pilik Aug 29 '14 at 09:44
4

Another variation using data.table

  vars <- c("x2", "x3")
  d[,  paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
               lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
   d
   #        Time Stock x1 x2 x3   x2_scale x2_sum   x3_scale x3_sum
  #1: 2014-08-22     A 15 27 34 -1.1175975    121  0.7310560     68
  #2: 2014-08-23     A 39 44 29  0.3073393    121  0.4085313     68
  #3: 2014-08-24     A 20 50  5  0.8102582    121 -1.1395873     68
  #4: 2014-08-22     B 42 22 43 -0.5401315     88  1.1226726     57
  #5: 2014-08-23     B 44 45 12  1.1539172     88 -0.3274462     57
  #6: 2014-08-24     B  3 21  2 -0.6137858     88 -0.7952265     57

Based on comments from @Arun, you could also do:

   cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
    d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))), 
                              rec=F), by=Stock, .SDcols=vars]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    Or `d[, (cols) := unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))), rec=FALSE), by=Stock, .SDcols=vars]` - we just have to get a list in `j`, where each element of the list will become a column. – Arun Aug 22 '14 at 11:45
3

You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):

library(dplyr)
dt %>%
  group_by(Stock) %>%
  mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
#        Time Stock x1 x2 x3 x2_sum x3_sum   x2_scale   x3_scale
#1 2014-08-22     A 15 27 34    121     68 -1.1175975  0.7310560
#2 2014-08-23     A 39 44 29    121     68  0.3073393  0.4085313
#3 2014-08-24     A 20 50  5    121     68  0.8102582 -1.1395873
#4 2014-08-22     B 42 22 43     88     57 -0.5401315  1.1226726
#5 2014-08-23     B 44 45 12     88     57  1.1539172 -0.3274462
#6 2014-08-24     B  3 21  2     88     57 -0.6137858 -0.7952265

You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).

EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.

Jaap
  • 81,064
  • 34
  • 182
  • 193
talat
  • 68,970
  • 21
  • 126
  • 157
2

EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.

library(functional)
funs <- list(scale=Compose(scale, c), sum=sum)    # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")

d[, 
  c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
  .SDcols=cols,
  by=Stock
]

Produces:

         Time Stock x1 x2 x3   x2_scale   x3_scale x2_sum x3_sum
1: 2014-08-22     A 15 27 34 -1.1175975  0.7310560    121     68
2: 2014-08-23     A 39 44 29  0.3073393  0.4085313    121     68
3: 2014-08-24     A 20 50  5  0.8102582 -1.1395873    121     68
4: 2014-08-22     B 42 22 43 -0.5401315  1.1226726     88     57
5: 2014-08-23     B 44 45 12  1.1539172 -0.3274462     88     57
6: 2014-08-24     B  3 21  2 -0.6137858 -0.7952265     88     57
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • What variable specifications are you talking about? It's obviously easy to have 4 functions do 4 common tasks, and move everything else to `do`. – Arun Aug 22 '14 at 12:12
  • @Arun, things like this, where you want to assign to columns that are not known ahead of time, operating on other columns that are not known ahead of time (i.e. the specs come in in variables). – BrodieG Aug 22 '14 at 12:12
  • @Arun, though if you have a better way of doing something like above that is fully generic I'd love to see it. This is the best I could come up with on a short time frame, and while it works it is a bit convoluted (not to mention it uses `data.frame`...). – BrodieG Aug 22 '14 at 12:14
  • data.table already generates name automatically for aggregations. It expects names for `:=`. Is this what you mean as "weak area"? If so, why not file an issue for a FR? But that's not the problem here. The problem is using many aggregation functions within `lapply(.)`. – Arun Aug 22 '14 at 12:17
  • I've already provided a solution under Akrun's answer. But thanks, we'll see if this can be improved. – Arun Aug 22 '14 at 12:17
  • @Arun, I think the one additional piece I'd want to see is the ability to specify a variable number of functions as functions. In your answer the function **calls** are specified explicitly. This version here allows you to add as many functions as you wish, along the lines of `funs(scale, sum)`. – BrodieG Aug 22 '14 at 12:20
  • @Arun, Re: FR, you're right, I should do that instead of belly-ache about it. I'll think about the best way to submit it. – BrodieG Aug 22 '14 at 12:20
  • BrodieG, what's there to think about? Do you've a github account? Then just file an issue here: https://github.com/Rdatatable/data.table/issues on whatever you think could improve the functionality. We're pretty open to suggestions. – Arun Aug 22 '14 at 12:22
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/59809/discussion-between-brodieg-and-arun). – BrodieG Aug 22 '14 at 12:22