12

Suppose I have a data.table

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

I can add new columns like this:

a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
   id  a  b sa sb
1:  1 21 11 43 23
2:  1 22 12 43 23
3:  2 23 13 47 27
4:  2 24 14 47 27
5:  3 25 15 25 15

However, suppose that I have column names instead:

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}

what do I do?

sds
  • 58,617
  • 29
  • 161
  • 278
  • Possibly relevant: http://stackoverflow.com/q/16617226/324364 – joran Jan 09 '14 at 15:51
  • And a bit older: http://stackoverflow.com/a/12392269/1412059 – Roland Jan 09 '14 at 15:52
  • ...and 1.6 from the data.table FAQ. – joran Jan 09 '14 at 15:53
  • @joran: I don't see how I could use `eval`/`quote` here; I tried `s <- quote(paste0("s..",n) := sum(n)); a[, eval(s), by="id"]` and failed. – sds Jan 09 '14 at 16:06
  • 1
    @sds I was just brainstorming. I sympathize, though. The level of confusion data.table creates for me with this type of problem is the single biggest reason I don't use it more often. – joran Jan 09 '14 at 16:10
  • @joran Then I hope my answer and comments help? – Matt Dowle Jan 09 '14 at 16:18
  • @MattDowle It's my problem, I'm too dumb. I deal better with syntax designed for stupid people. – joran Jan 09 '14 at 16:22
  • @joran Do you have any suggestions? Or even any other way to add/update columns by reference by group? Let alone where those columns are defined in a variable. It's `DT[where, select|update, by]`. It can be restricted to subset of rows just by supplying `where` too. Do you know SQL? – Matt Dowle Jan 09 '14 at 16:46
  • @MattDowle: which should I use - your method (`get`) or @eddi's (`lapply`/`.SD`)? (I upvoted both, but I am now torn between them) – sds Jan 09 '14 at 16:47
  • @MattDowle I do know quite a lot of SQL. I'm not criticizing, I'm just saying that it pretty routinely makes me feel quite dumb. And the ease with which other people seem to have with it leads me to conclude that the problem is me, not data.table. – joran Jan 09 '14 at 16:50
  • @joran I'm British so I read it as sarcasm. Like : https://twitter.com/Cafe_Global/status/417978158470930432/photo/1 – Matt Dowle Jan 09 '14 at 17:03
  • @sds Eddi's is the best as it avoids grouping many times. That's the one to accept. Great question btw. – Matt Dowle Jan 09 '14 at 17:05
  • @MattDowle: I thought so too, but your version finished in 36 seconds, while his is still running (5+ minutes). The funny thing is that your version suggests his: "'get' found in j. xvars being set to all columns. Use .SDcols or eval(macro) instead. Both will detect the columns used which is important for efficiency." – sds Jan 09 '14 at 17:09
  • @sds that's curious - can you post a larger dataset where you see that, when I test on something like `N = 1e7; a = data.table(id = 1:N, a = sample(N), b = sample(N))`, `.SD` version is much faster – eddi Jan 09 '14 at 17:15
  • @eddi: your data set has _trivial_ `.SD` (one line!) – sds Jan 09 '14 at 17:17
  • @sds sure, but the bottleneck in these scenarios is large number of groups and not size of `.SD`; anyway, testing on `id = sample(1:1e4, N, T)` again results in `.SD` being several times faster - so I'm curious what the dimensions of your data are and if you can come up with a simple similar example – eddi Jan 09 '14 at 17:19
  • @eddi: my `.SDcols` is not the same as all the columns (as opposed to the examples here). my data is `Classes ‘data.table’ and 'data.frame': 3693671 obs. of 10 variables:` – sds Jan 09 '14 at 17:54
  • @sds Sounds like we should investigate this further. Can you construct a dataset and show the exact two commands so that we can take a look? – Matt Dowle Jan 09 '14 at 18:09
  • @MattDowle: I have the data set. where do you want me to upload it? shouldn't we move this discussion somewhere? (mailing list?) – sds Jan 09 '14 at 18:21
  • @sds Haven't tried it myself but how about http://www.filedropper.com/. A new question on S.O. would be best. This question was "how". New question would be something like "why is one of these commands slower", linking to this question. – Matt Dowle Jan 09 '14 at 18:41
  • @sds But a randomly generated dataset would be even better, like eddi's example above. – Matt Dowle Jan 09 '14 at 19:50
  • @joran, probably you should start by asking questions on things you don't understand here on S.O. or on data.table mailing list then? – Arun Jan 09 '14 at 23:58

4 Answers4

11

You can also do this:

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

a[, c("sa", "sb") := lapply(.SD, sum), by = id]

Or slightly more generally:

cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    +1 Even better as that does it in a single grouping, rather than a grouping for each column being added one by one. Using `.SD` in this case shouldn't bite, because all the data inside `.SD` is being used. – Matt Dowle Jan 09 '14 at 16:28
  • @MattDowle as an aside - the main bite from using `.SD` happens when `[.data.table` is involved because of the huge overhead of that function, so something like `.SD[1:.N]` would be orders of magnitude slower in a "by" loop that `.SD` – eddi Jan 09 '14 at 16:36
  • Oh yes, thanks for reminder. It's on the list to optimize that. – Matt Dowle Jan 09 '14 at 16:48
  • 1
    @eddi, just to remind [this](http://stackoverflow.com/a/20460441/559784) discussion as well. It's not just `[.data.table`. It certainly gets things slow. But just evaluating `.SD` also seems to get things slow. I'll make a proper benchmark to test this with data.frame and data.table and post back. – Arun Jan 09 '14 at 19:27
8

This is similar to :

How to generate a linear combination of variables and update table using data.table in a loop call?

but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.

I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310
Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.

However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, (s) := sum(get(n)), by="id"]
}
Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
2

Edit 2020-02-15 about ..

data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below


have a look at with in ? data.table:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]

EDIT:

Or you just change the colnames forth and back:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]

dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )

dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )

which works together with by.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Beasterfield
  • 7,023
  • 2
  • 38
  • 47
0

Here is an approach that does the call mangling and avoids any overhead with .SD

# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <-  as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])

#    id  a  b sa sb
# 1:  1 21 11 43 23
# 2:  1 22 12 43 23
# 3:  2 23 13 47 27
# 4:  2 24 14 47 27
# 5:  3 25 15 25 15
mnel
  • 113,303
  • 27
  • 265
  • 254