11

Apologies for a question that probably makes it obvious that I usually work in Python/pandas, but I'm stuck with this. How do I select a data.table column using a string?

dt$"string"
dt$as.name("string")
dt$get("string")

I'm sure this is super simple, but I'm not getting it. Any help is greatly appreciated!


-------------- EDITED TO ADD ----------------------

After some of the helpful comments and tips below, I think I've narrowed down the problem a bit and have a reproducible example. Consider:

dt = data.table(ID = c("a","a","a","b","b","b"), col1=rnorm(6), col2=rnorm(6)*100)

And assume we want to assign the values in col2 to col1. As I've learned below, the data.table syntax for this would be dt[,col1:=col2], clean and simple. The problems start when one (or both) of the variables in the j argument are strings. I found the following:

dt[, "col1":=col2] works as expected

dt[, "col1":="col2"] fails as expected (tries to assign the character col2 to the double vector col1

dt[, "col1":=get("col2")] works as expected

dt[, get("col1")] returns col1 as expected

but: dt[, get("col1"):=col2] or any other assignment fails.

Some context: the reason for doing this is that I'm constructing strings in a loop, to access a larger number of columns that are all named colname_colnumber, i.e. I loop over colname and colnumber to then access column paste0(colname,colnumber).

Nils Gudat
  • 13,222
  • 3
  • 39
  • 60
  • 2
    Btw, `dt$"string"` should work perfectly fine, what exactly is the error? Though expressions on RHS of `$` will never work, see [this faq](http://stackoverflow.com/questions/18222286/select-a-data-frame-column-using-and-the-name-of-the-column-in-a-variable) – David Arenburg May 31 '15 at 21:04
  • You do have a point - I think I got the error as I was constructing the string inside a loop, so used something like `$paste(x,y)` and then forgot to try without using an expression before asking the question! – Nils Gudat May 31 '15 at 21:13
  • 1
    It's still not quite clear why the `dt[,"col1":=get("col2")]` option isn't sufficient for your purposes, e.g. `for (ii in 1:2) dt[,paste0("col",ii):=get("col2")]` seems to be what you want. Perhaps you need to expand the scope of your example a bit. – MichaelChirico Jun 02 '15 at 13:04
  • Point taken - the option you're mentioning would work (and that's who I'll do it). The problem occured because I stored the constructed string first, as I'm using it in other places as well. If I then do `var<-paste0("col","1")`, I'd have to rely on the `get()` method again, which doesn't work. So while your suggestions works in my case, I'd still like to understand why the behaviour of `data.table` changes between the simple `get()` call and the `get()` call with assigmnent. – Nils Gudat Jun 02 '15 at 13:32
  • @Nilsgudat its a fair question. I'm not at a computer but my instincts tell me that once you use `get` the assignment option is no longer available, as a vector is already returned. you can assign a vector to a column name, but you can't assign a vector of names to a vector (without being more careful) – MichaelChirico Jun 03 '15 at 05:38

3 Answers3

16

You can use get() as the j argument using single brackets:

library(data.table)
dt <- data.table(iris)
dt[, get("Species")]

The result:

[1] setosa     setosa     setosa     setosa     setosa     setosa .....

You can also use a string directly inside the double bracket operator, like this:

dt[["Species"]]
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • `get()` doest the trick, but for some reason it doesn't work in assignment operations. What am I doing wrong? Say I have `var1 <- "string1"` and `var2 <- "string2", where my data table has columns named `string1` and `string2`. `dt[, get(var1)]` and `dt[,get(var2)]` work as expected, but if I want to assign the values in column 2 to column 1 using `dt[,get(var1)] <- dt[,get(var2)]` I get `Error in get(var1): object 'string1' not found`. Any ideas what's going on here? – Nils Gudat Jun 01 '15 at 17:00
  • Or even simpler: `dt[, get(var1)]` works, while `dt[,get(var1)] <- 0` throws the `object not found` error. – Nils Gudat Jun 01 '15 at 17:17
  • 5
    You should avoid `<-` assignment in `data.table` as much as possible. Try `dt[,(var1):=var2]`, or `dt[,(var1):=0]` for your other example. I recommend the [starter vignettes](https://github.com/Rdatatable/data.table/wiki/Getting-started) for some very readable examples of how to do the basics. – MichaelChirico Jun 01 '15 at 18:32
  • That syntax is definitely a lot cleaner, but it still won't work - `dt[,get(var1):=0]` throws the same error. I can't think of any reason why a column would be accessible with the simple call but unavailable in an assignment call, but I'll definitely read through the document you linked! – Nils Gudat Jun 01 '15 at 21:18
  • @NilsGudat Hmm in that case I don't know what to tell you. If the problem persists after you've "done your reading", please edit a reproducible example into your post. I'm not sure why you're focused on using `get`, as there are several alternatives that should work, but perhaps some more context is necessary--if so, add it to your post. – MichaelChirico Jun 01 '15 at 22:04
  • Sure, I see that my question isn't great in the reproducibility department - I'm creating variable names in a loop concatenating strings, so I need `get` to actually access the variable, and indeed it appears that `get` is causing the problem, as `dt[,string1 := 0] works without problems. I'll give it some more thought and edit my question as necessary. – Nils Gudat Jun 01 '15 at 22:12
7

I'll add that if you want a bunch of columns, you may wish to use something like:

dt[ , c("id", paste0("col", 1:10)), with = FALSE]

As @Arun adds below, other options for getting multiple columns are:

dt[ , mget(c("id", paste0("col", 1:5)))]

and

dt[ , .SD, .SDcols = c("id", paste0("col", 1:5))]

In recent versions of data.table (e.g. current CRAN) you can also use the "up-a-level" notation like:

keep_cols = c('id', paste0('col', 1:5))
dt[ , ..keep_cols]

For reference, mget seems to be very slow; .SDcols is fastest, but competitive with with = FALSE; I personally find all to be useful/most natural in different situations.

Here's a simple benchmark:

NN <- 10000L
MM <- 100L
mm <- 10L

DT = data.table(id = 1:NN)
DT[ , paste0("col", 1:MM) := lapply(integer(MM), function(x) runif(NN))]

sdcols = function(...) DT[ , .SD, .SDcols = paste0("col", sample(MM, size = mm))]
m.get = function(...) DT[ , mget(paste0("col", sample(MM, size=mm)))]
withF = function(...) DT[ , paste0("col", sample(MM, size = mm)), with = FALSE]

library(microbenchmark)
microbenchmark(times=100L, sdcols(), m.get(), withF())
# Unit: microseconds
#      expr      min        lq      mean    median        uq      max neval cld
#  sdcols()  780.201  810.4350  865.3564  827.4970  853.4875 2354.577   100 a  
#   m.get() 2792.293 2864.1225 3052.3872 2899.9370 3031.9260 4831.963   100   c
#   withF()  897.822  927.7105 1005.3166  945.9495  981.0580 2600.445   100  b 
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
5

You can do assignments without get but using brackets:

dt[, ("col1"):=col2]

instead of:

dt[, get("col1"):=col2]

See for more explanation: Select / assign to data.table variables which names are stored in a character vector

Community
  • 1
  • 1
Evertvdw
  • 827
  • 10
  • 17