46

data.table is a fantastic R package and I am using it in a library I am developing. So far all is going very well, except for one complication. It seems to be much more difficult (compared to the conventional data frames) to refer to data.table columns using names saved in variables (as for data frames would be, for example: colname="col"; df[df[,colname]<5,colname]=0).

Perhaps what complicates the things most is the apparent lack of consistency of syntax on this in data.table. In some cases, eval(colname) and get(colname), or even c(colname) seem to work. In others, DT[,colname, with=F] is the solution. Yet in others, such as, for example, the set() and subset() functions, I haven't found a solution at all. Finally, an extreme, albeit also quite common use case was discussed earlier (passing column names to data.table programmatically) and the proposed solutions, albeit apparently doing their job, did not seem particularly readable...

Perhaps I am complicating things too much? If anyone could jot down a quick cheatsheet for referring to data.table column names using variables for different common scenarios, I would be very grateful.

UPDATE:

Some specific examples that work provided I can hard code column names:

x.short = subset(x, abs(dist)<=100)
set(x, which(x$val<10), "val", 0) 

Now assume distcol="dist", valcol="val". What is the best way to do the above using distcol and valcol, but not dist and val?

Community
  • 1
  • 1
msp
  • 1,059
  • 1
  • 9
  • 17
  • This question seems too unfocussed. Could be improved if you offered specific test cases. – IRTFM May 17 '13 at 21:20
  • OK, will transfer the examples from the discussion below to the question itself. – msp May 17 '13 at 21:28

5 Answers5

29

If you are going to be doing complicated operations inside your j expressions, you should probably use eval and quote. One problem with that in current version of data.table is that the environment of eval is not always correctly processed - eval and quote in data.table (Note: There has been an update to that answer based on an update to the package.) - and the current fix for that is to add .SD to eval. As far as I can tell from a few tests that I've run this doesn't affect speed (the way e.g. having .SD[1] in j would).

Interestingly this issue only plagues the j and you'll be fine using eval normally in i (where .SD is not available anyway).

The other problem is assignment, and there you have to have strings. I know one way to extract the string name from a quoted expression - it's not pretty, but it works. Here's an example combining everything together:

x = data.table(dist = c(1:10), val = c(1:10))
distcol = quote(dist)
valcol = quote(val)

x[eval(valcol) < 5,
  capture.output(str(distcol, give.head = F)) := eval(distcol)*sum(eval(distcol, .SD))]

Note how I was ok not adding .SD in one eval(distcol), but won't be if I take it out of the other eval.

Another option is to use get:

diststr = "dist"
valstr = "val"

x[get(valstr) < 5, c(diststr) := get(diststr)*sum(get(diststr))]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Thanks a lot for the explanation. However as I mentioned in my comment to @Frank above, my problem is with using `quote(dist)` and `quote(val)`. In my case I don't have access to these, only to variables `distcol="dist"` and `valcol="val"`. Is there a solution for this? – msp May 17 '13 at 21:37
  • 1
    You can go the other way, I guess, from string to `quote`-thing. – Frank May 17 '13 at 21:40
  • @msp added a version for that - it works for this simple example, but expect trouble ahead :) – eddi May 17 '13 at 21:41
  • Note that if `valcol="val"` rather than `valcol=quote(val)`, `x[eval(valcol) < 5]` no longer works. – msp May 17 '13 at 21:42
  • @msp - yep, and it shouldn't - see `?quote` and `?eval` – eddi May 17 '13 at 21:44
  • Wow! Do you have a link to where you discovered about `capture.output(str(distcol, give.head = F))`? – James Hirschorn Oct 02 '19 at 06:14
11

Maybe you know about this solution already?

DT[[colname]]

This is inspired by @eddi's solution in the comments below, using the OP's example:

set.seed(1)
x = data.table(a = 1:10, b=rnorm(10))
colstr="b"
col <- eval(parse(text=paste("quote(",colstr,")",sep="")))
x[eval(col)<0]
x[eval(col)<0,c(colstr):=-100]
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Thanks, I do actually, but it doesn't always work - `set()` and `subset()` being examples - and what I don't know if when it does and when it doesn't... – msp May 17 '13 at 20:31
  • 1
    Okay, `set` and `subset` are outside of what I normally use. I know there are some questions on SO about using `quote` and `eval`; those might help... – Frank May 17 '13 at 20:36
  • I wonder what you use instead of these functions - perhaps I should do too. For example, in these cases: `x.short = subset(x, abs(dist)<=100); set(x, which(x$val<10), "val", 0)`. Assume `distcol="dist", valcol="val"`. Would you know how to plug them in? Thanks! – msp May 17 '13 at 20:42
  • @msp - can you give an example of what you want to do with `set` or `subset` (it's ok if it doesn't work, I just don't quite understand what you want)? – eddi May 17 '13 at 20:45
  • 2
    @msp for `j` the trick is to add `.SD` to `eval` - http://stackoverflow.com/questions/15913832/eval-and-quote-in-data-table so you can do smth like `distcol = quote(dist); x[, abs(eval(distcol, .SD)) <= 100]`; for `i` in your example it should just work - `valcol = quote(val); x[eval(valcol) < 10, val := 0]` – eddi May 17 '13 at 20:54
  • 2
    it should also be clear from above that I think you should use `quote`'d names instead of strings if you are going to be doing complicated operations (you can get away with strings for the easy stuff, but it'll become increasingly difficult with more complex expressions) – eddi May 17 '13 at 21:00
  • Thanks eddi, very helpful - why don't you write it as a proper reply, so I could accept it? – msp May 17 '13 at 21:03
  • @eddi That works for me when using `[` -- which one can use instead of `subset` in most (all?) cases -- though I don't even know where to start with `set`. Will you make it an answer? – Frank May 17 '13 at 21:03
  • And so am i right that for `:=` in `j` it would work like this: `valcol = quote(val); x[eval(valcol) < 10, eval(valcol, .SD) := 0]`. A stupid question perhaps, but I heard that using `.SD` in some other cases slows things down quite a bit. Will it do in this case? Obviously I'm not subsetting `.SD` or anything here, but would referring to it as such not affect the speed? – msp May 17 '13 at 21:06
  • @eddi In fact there is a problem with your suggestion: – msp May 17 '13 at 21:12
  • @eddi Assume `x = data.table(a = 1:10, b=rnorm(10)); colname="b"`. What would be the equivalent of `subset(x, b<0)` and `set(x, which(x$b<0), "b", -100)`? `x[eval(a)<0,eval(a,.SD):=-100]` doesn't do anything. – msp May 17 '13 at 21:20
  • @msp I've added what I think eddi's answer might look like above. – Frank May 17 '13 at 21:28
  • @Frank Thanks, but the problem I now have is with `quote(b)`. Assume I can't hardcode `b`, I only get access to `colstr`. Is there a way round? Will now study eddi's answer below as well. – msp May 17 '13 at 21:34
  • @msp I've added a messy way of passing "b" as a string and making a `quote(b)` out of it. :) – Frank May 17 '13 at 21:39
  • It's actually quite inconceivable to me why `x[eval(col)<0]` in your example works, while `x[eval(colstr)<0]` returns an empty data.table. – msp May 17 '13 at 21:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/30154/discussion-between-frank-and-msp) – Frank May 17 '13 at 21:42
  • 2
    I like the conversion to a quoted expression :) – eddi May 17 '13 at 21:46
  • @Frank, sorry I didn't notice your update for the quoted expression. Looks pretty scary, but, more importantly, it works :) – msp May 17 '13 at 22:04
4

Say you have the column name in variable x, you could do

colname = as.name(x)

you can then use colname in the subset function

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
saiteja
  • 488
  • 3
  • 12
  • 1
    This did not work for me in a data.table. Maybe it does in the subset function which you mention? If it works in data.tables as well, could you give an example? –  Oct 07 '14 at 16:52
2

Another neat solution is to rename your column to something fixed, do your operations on the fixed name column, and rename back. Avoids all the unreadable and unrememberable code.

setnames(dt, colname, "fixed_")
dt[,fixed_:=paste0("foo-",fixed_)]
setnames(dt, "fixed_", colname)
isthisthat
  • 143
  • 1
  • 10
1

eval is definitely not a recommended approach to subset a data.table using dynamically saved variables. The following example will help:

# Toy data.table example
DT = data.table(a = c(1,2,3), b = c(4,5,6))

# Saved variable
mVar <- "a"

# Subset
DT[DT[[mVar]] < 2]

eval is very sensitive to complex character expressions and generally not recommended for production code.

mammask
  • 285
  • 1
  • 3
  • 10
  • Fwiw, if you control the expression that is substituted in, I think it should be fairly safe, like `ex0 = quote(v < 2); ex = do.call("substitute", list(ex0, list(v = as.name(mVar)))); DT[eval(ex)]` If someone passes mVar that would be evaluated as code rather than a name/symbol, I expect `as.name` would fail. Such evaluation is needed to take advantage of auto-indexing, GForce and other data.table optimizations. – Frank Aug 26 '19 at 22:08