107

How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:

df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
#   col1
# 1    4
# 2    5
# 3    6

How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).

Henrik
  • 65,555
  • 14
  • 143
  • 159
frankc
  • 11,290
  • 4
  • 32
  • 49

6 Answers6

149

Two ways to programmatically select variable(s):

  1. with = FALSE:

     DT = data.table(col1 = 1:3)
     colname = "col1"
     DT[, colname, with = FALSE] 
     #    col1
     # 1:    1
     # 2:    2
     # 3:    3
    
  2. 'dot dot' (..) prefix:

     DT[, ..colname]    
     #    col1
     # 1:    1
     # 2:    2
     # 3:    3
    

For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).

To assign to variable(s), wrap the LHS of := in parentheses:

DT[, (colname) := 4:6]    
#    col1
# 1:    4
# 2:    5
# 3:    6

The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:

Using with = FALSE with := is now deprecated in all cases, given that wrapping the LHS of := with parentheses has been preferred for some time.

colVar = "col1"
DT[, (colVar) := 1]                             # please change to this
DT[, c("col1", "col2") := 1]                    # no change
DT[, 2:4 := 1]                                  # no change
DT[, c("col1","col2") := list(sum(a), mean(b))]  # no change
DT[, `:=`(...), by = ...]                       # no change

See also Details section in ?`:=`:

DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol

And to answer further question in comment, here's one way (as usual there are many ways) :

DT[, colname := cumsum(get(colname)), with = FALSE]
#    col1
# 1:    4
# 2:    9
# 3:   15 

or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :

expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"

eval(parse(text=expr))
#    col1
# 1:    4
# 2:   13
# 3:   28

If you do that a lot, you can define a helper function EVAL :

EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))

EVAL("DT[,",colname,":=cumsum(",colname,")]")
#    col1
# 1:    4
# 2:   17
# 3:   45

Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.

Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.

set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
#    col1
# 1:    4
# 2:   21
# 3:   66
s_baldur
  • 29,441
  • 4
  • 36
  • 69
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 1
    Thanks for the reply Matthew. The with=FALSE definitely solves part of my problem. In reality though, I want to replace the column with the cumsum of the column. Can I reference the column name by variable on the right-hand side of the assignment somehow? – frankc Sep 12 '12 at 16:48
  • Acutally, I just storded the cumsum externally with a different name that doesn't exist inside the dt and that works fine. – frankc Sep 12 '12 at 17:18
  • 1
    But that would be whole extra line! Not very elegant :) But ok sometimes it's useful. In those cases best to start the variable name with `.`, or `..` to avoid any potential masking if `DT` ever did contain that symbol as a column name in future (and stick to the convention that column names don't start with `.`). There are some feature requests to make it more robust to scope issues like that, such as adding `.()` and `..()`. – Matt Dowle Sep 12 '12 at 17:27
  • I replied before I noticed you edited your answer. My first thought had been eval(parse()) but for some reason I was having trouble getting it to work, when it dawned on me to just do it externally. This is a great answer with lots of things I didn't think about. Thanks for data.table in general, it's a great package. – frankc Sep 12 '12 at 17:32
  • 2
    Note that you could use the quasi-perl type string interpolation of `fn$` from the gsubfn package to improve the readability of the EVAL solution: `library(gsubfn); fn$EVAL( "DT[,$colname:=cumsum($colname)]" )` . – G. Grothendieck Jan 14 '13 at 14:11
  • From `?':='` it seems like this answer needs an update: __________ "`DT[i,colnamevector:=value,with=FALSE]` # old syntax. The contents of colnamevector in calling scope determine the column names or positions to update (or add) _____________________________ `DT[i,(colnamevector):=value]` # same, shorthand. Now preferred. The parens are enough to stop the LHS being a symbol" – jan-glx Aug 18 '15 at 08:57
  • @MattDowle, Apologies to reopen a very old question, but could you also clarify how to reference a column of a data table if the column number (which is also its name) is stored in an integer variable (which could be another column of the same data table)? I tried to pass the variable inside (), inside .() and inside c() ... also tried to convert from integer to character but it always returns the numbers themselves. But if we just hardcode the numbers like `DT[,c("5","10")]` it outputs the columns named 5 and 10. – Lazarus Thurston Mar 18 '18 at 07:55
  • @SanjayMehrotra Not quite following (please ask a new question with reproducible example) but maybe first item in [NEWS for v1.10.2](https://github.com/Rdatatable/data.table/blob/master/NEWS.md#changes-in-v1102--on-cran-31-jan-2017) helps. – Matt Dowle Mar 19 '18 at 17:39
  • I have not been able to assign a column to another column such as DT[, (colVar) := DT2[,(colVar)] ]. How can this be achieved? – Herman Toothrot Jan 02 '20 at 23:05
  • how does it work if you only want to use one element in a vector such as colnamevector(i) in a loop for example? – Herman Toothrot Jun 10 '20 at 14:20
  • @HermanToothrot New questions please with reproducible examples. Thanks. – Matt Dowle Jun 11 '20 at 17:36
  • Is there a way to assign a column name for a "temporary output column" (`=` )? Both `(colvar)` and `..colvar` fail as does `get()` on LHS. I do not want to modify the source DT with `:=`. –  Aug 13 '21 at 16:24
  • @JarnoP. By fail do you mean it works but it modifies DT? Use `copy()` first if you want modifications to be temporary. – Matt Dowle Aug 15 '21 at 00:21
  • Hi @MattDowle, I use chaining a lot to run different analyses of the same data set. I do not want to modify the data set when running the analyses. The chaining works (creates on the fly a temporary version in every case except when I need to (temporarily) assign values to a column named with variable. I am running close to memory limits so making a copy of the source data set is not an option. Of course I can break down the chain into two parts and do `copy()`. It just makes the code more complicated. Anyways, thank you Matt and other contributors for a simply awesome library! –  Aug 16 '21 at 09:06
  • @MattDowle, what is the reason `get(var)` fails when used on LHS in temporary assignments? From syntax perspective both the `get(var)` and `..var` "should" work fine (i.e. there is no other reserved use for those at LHS). This is all about compactness of the code. At the moment I am resorting to using fixed temp names and then using `setnames()`. It "works", but it is not elegant at all. –  Sep 09 '21 at 06:18
9

*This is not an answer really, but I don't have enough street cred to post comments :/

Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?

colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))

Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.

efh0888
  • 328
  • 3
  • 12
  • +1 for just working: I agree this must not be "the way" to do this, but having just spent like 45 minutes pouring over every SO post on this subject, this is the only solution that I have actually been able to get to work - thanks for taking the time to point it out! – neuropsych Jan 24 '16 at 21:39
  • Glad I could help! Unfortunately, I never did find a more elegant solution directly using data.tables, although this 3 liner isn't terrible. In my scenario, I did realize a simpler alternative would have been to use tidyr to just make my data "long" instead of "wide", since based on user input, I could always filter on a single column rather than selecting from a set of columns. – efh0888 Jan 25 '16 at 21:27
  • 2
    It's not safe to assume `V1` is the new name. For example, if you read csv with `fread` and there is a unnamed column, it will has`V1` name (and `read.csv` will give `X`). So it's possible your table already have a `V1`. Maybe just get the name by `names(DT)[length(names(DT))]` – dracodoc Aug 10 '16 at 19:45
5

Retrieve multiple columns from data.table via variable or function:

library(data.table)

x <- data.table(this=1:2,that=1:2,whatever=1:2)

# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]

# === indirect via  variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)

x[, ..mycols]
x[, .SD, .SDcols=mycols]

# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]

which all yield

   that whatever
1:    1        1
2:    2        2

I find the .SDcols way the most elegant.

C.K
  • 284
  • 3
  • 5
  • I also think that the `.SDcols` option is the most elegant. However you create a copy of the selected data internally which makes it expansive. `x[, c('that', 'whatever')]` & `x[, ..mycols]` are the fastest (They're doing the same). `x[, .(that, whatever)]` also takes more time, don't know why though. _Checked with `microbenchmark::microbenchmark()` ._ – andschar Mar 27 '23 at 12:24
3

With development version 1.14.3, has gained a new interface for programming on data.table, see item 10 in New Features. It uses the new env = parameter.

library(data.table) # development version 1.14.3 used
dt <- data.table(col1 = 1:3)
colname <- "col1"

dt[, cn := cn + 3L, env = list(cn = colname)][]
    col1
   <int>
1:     4
2:     5
3:     6
Uwe
  • 41,420
  • 11
  • 90
  • 134
2

For multiple columns and a function applied on column values.

When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.

The example below converts integer columns to numeric columns

a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class)  # show classes of columns
#         a           b          c1 
# "integer"   "integer" "character" 

# column name character vector
nm <- c("a", "b")

# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]

sapply(a1, class)
#         a           b          c1 
# "numeric"   "numeric" "character" 
Sathish
  • 12,453
  • 3
  • 41
  • 59
2

You could try this:

colname <- as.name("COL_NAME")
DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]
Erwan
  • 1,385
  • 1
  • 12
  • 22
  • 2
    It's always recommended to add an explanation with your code instead of just posting code. – MBorg Mar 19 '20 at 05:53