What I'm looking for is a "best-practices-approved" alternative to the following workaround / workflow. Consider that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.
I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,eval(anntot):=
the.table[,rep(eval(varname),each=12)]]
new.table[,eval(monthly):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,eval(rolling):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,eval(scaled):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table
actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table
special objects .I
, .SD
, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote
-ing and eval
-ing.
In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table
and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ...
approach is usually not a workable substitute. Also replacing each call of eval(a.column.name)
with DT[[a.column.name]]
neither simplifies much nor works completely in general since that doesn't play nice with the other data.table
operations, as far as I am aware.
I am aware of many workarounds for various use cases of variable column
names in data.table
, including:
- Select / assign to data.table when variable names are stored in a character vector
- Pass column name in data.table using variable
- Referring to data.table columns by names saved in variables
- passing column names to data.table programmatically
- Data.table meta-programming
- How to write a function that calls a function that calls data.table?
- Using dynamic column names in `data.table`
- Dynamic column names in data.table
- Assign multiple columns using := in data.table, by group
- Setting column name in "group by" operation with data.table
- Summarizing multiple columns with data.table
and probably more I haven't referenced.
But: even if I learned all the tricks documented above to the point that I never had to look them up to remind myself how to use them, I still would find that working with column names that are passed as parameters to a function is an extremely tedious task.