data.table operations by column name

Question

Suppose I have a data.table

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

I can add new columns like this:

a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
   id  a  b sa sb
1:  1 21 11 43 23
2:  1 22 12 43 23
3:  2 23 13 47 27
4:  2 24 14 47 27
5:  3 25 15 25 15

However, suppose that I have column names instead:

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}

what do I do?

Possibly relevant: http://stackoverflow.com/q/16617226/324364 — joran, Jan 09 '14 at 15:51
And a bit older: http://stackoverflow.com/a/12392269/1412059 — Roland, Jan 09 '14 at 15:52
@joran: I don't see how I could use `eval`/`quote` here; I tried `s <- quote(paste0("s..",n) := sum(n)); a[, eval(s), by="id"]` and failed. — sds, Jan 09 '14 at 16:06
@sds I was just brainstorming. I sympathize, though. The level of confusion data.table creates for me with this type of problem is the single biggest reason I don't use it more often. — joran, Jan 09 '14 at 16:10
@MattDowle It's my problem, I'm too dumb. I deal better with syntax designed for stupid people. — joran, Jan 09 '14 at 16:22
@joran Do you have any suggestions? Or even any other way to add/update columns by reference by group? Let alone where those columns are defined in a variable. It's `DT[where, select|update, by]`. It can be restricted to subset of rows just by supplying `where` too. Do you know SQL? — Matt Dowle, Jan 09 '14 at 16:46
@MattDowle: which should I use - your method (`get`) or @eddi's (`lapply`/`.SD`)? (I upvoted both, but I am now torn between them) — sds, Jan 09 '14 at 16:47
@MattDowle I do know quite a lot of SQL. I'm not criticizing, I'm just saying that it pretty routinely makes me feel quite dumb. And the ease with which other people seem to have with it leads me to conclude that the problem is me, not data.table. — joran, Jan 09 '14 at 16:50
@joran I'm British so I read it as sarcasm. Like : https://twitter.com/Cafe_Global/status/417978158470930432/photo/1 — Matt Dowle, Jan 09 '14 at 17:03
@sds Eddi's is the best as it avoids grouping many times. That's the one to accept. Great question btw. — Matt Dowle, Jan 09 '14 at 17:05
@MattDowle: I thought so too, but your version finished in 36 seconds, while his is still running (5+ minutes). The funny thing is that your version suggests his: "'get' found in j. xvars being set to all columns. Use .SDcols or eval(macro) instead. Both will detect the columns used which is important for efficiency." — sds, Jan 09 '14 at 17:09
@sds that's curious - can you post a larger dataset where you see that, when I test on something like `N = 1e7; a = data.table(id = 1:N, a = sample(N), b = sample(N))`, `.SD` version is much faster — eddi, Jan 09 '14 at 17:15
@sds sure, but the bottleneck in these scenarios is large number of groups and not size of `.SD`; anyway, testing on `id = sample(1:1e4, N, T)` again results in `.SD` being several times faster - so I'm curious what the dimensions of your data are and if you can come up with a simple similar example — eddi, Jan 09 '14 at 17:19
@eddi: my `.SDcols` is not the same as all the columns (as opposed to the examples here). my data is `Classes ‘data.table’ and 'data.frame': 3693671 obs. of 10 variables:` — sds, Jan 09 '14 at 17:54
@sds Sounds like we should investigate this further. Can you construct a dataset and show the exact two commands so that we can take a look? — Matt Dowle, Jan 09 '14 at 18:09
@MattDowle: I have the data set. where do you want me to upload it? shouldn't we move this discussion somewhere? (mailing list?) — sds, Jan 09 '14 at 18:21
@sds Haven't tried it myself but how about http://www.filedropper.com/. A new question on S.O. would be best. This question was "how". New question would be something like "why is one of these commands slower", linking to this question. — Matt Dowle, Jan 09 '14 at 18:41
@sds But a randomly generated dataset would be even better, like eddi's example above. — Matt Dowle, Jan 09 '14 at 19:50
@joran, probably you should start by asking questions on things you don't understand here on S.O. or on data.table mailing list then? — Arun, Jan 09 '14 at 23:58

score 11 · Accepted Answer · answered Jan 09 '14 at 16:25

11

You can also do this:

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

a[, c("sa", "sb") := lapply(.SD, sum), by = id]

Or slightly more generally:

cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]

answered Jan 09 '14 at 16:25

eddi

49,088
6
104
155

1

+1 Even better as that does it in a single grouping, rather than a grouping for each column being added one by one. Using `.SD` in this case shouldn't bite, because all the data inside `.SD` is being used. – Matt Dowle Jan 09 '14 at 16:28
@MattDowle as an aside - the main bite from using `.SD` happens when `[.data.table` is involved because of the huge overhead of that function, so something like `.SD[1:.N]` would be orders of magnitude slower in a "by" loop that `.SD` – eddi Jan 09 '14 at 16:36
Oh yes, thanks for reminder. It's on the list to optimize that. – Matt Dowle Jan 09 '14 at 16:48
1

@eddi, just to remind [this](http://stackoverflow.com/a/20460441/559784) discussion as well. It's not just `[.data.table`. It certainly gets things slow. But just evaluating `.SD` also seems to get things slow. I'll make a proper benchmark to test this with data.frame and data.table and post back. – Arun Jan 09 '14 at 19:27

score 8 · Answer 2 · edited May 23 '17 at 12:25

8

This is similar to :

How to generate a linear combination of variables and update table using data.table in a loop call?

but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.

I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310
Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.

However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, (s) := sum(get(n)), by="id"]
}

edited May 23 '17 at 12:25

Community

1
1

answered Jan 09 '14 at 16:04

Matt Dowle

58,872
22
166
224

I don't need the parens around `s`, do I? – sds Jan 09 '14 at 16:08
Yes otherwise it's going to create a column called `"s"` in both iterations of the loop. – Matt Dowle Jan 09 '14 at 16:10
Oh sorry - I missed the `with=FALSE`. We've moved towards `(LHS):=` because that's easier to read. `with=FALSE` is unclear whether it refers to the LHS or the RHS of `:=`. – Matt Dowle Jan 09 '14 at 16:13
That reminds me, we should formally deprecate `with=FALSE` when `:=` is used. – Matt Dowle Jan 09 '14 at 16:14
Right so very little wrong is the question. Just wrapping `n` with `get(n)` was all that was needed. The error was from `sum("a")`. – Matt Dowle Jan 09 '14 at 16:17

score 2 · Answer 3 · edited Feb 15 '20 at 03:48

Edit 2020-02-15 about `..`

data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below

have a look at with in ? data.table:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]

EDIT:

Or you just change the colnames forth and back:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]

dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )

dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )

which works together with by.

score 0 · Answer 4 · answered Jan 09 '14 at 23:30

Here is an approach that does the call mangling and avoids any overhead with .SD

# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <-  as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])

#    id  a  b sa sb
# 1:  1 21 11 43 23
# 2:  1 22 12 43 23
# 3:  2 23 13 47 27
# 4:  2 24 14 47 27
# 5:  3 25 15 25 15

data.table operations by column name

4 Answers4

Edit 2020-02-15 about ..

Edit 2020-02-15 about `..`