0

One problem I often have with data.table and ggplot is their use in a for loop, in which I iterate over a set of column names.

Take this data table as example:

dt <- data.table(values1=rep(c(1,2),each=2),
                 values2=rep(c(10,20),each=2),
                 notthis=0,
                 category=rep(c('a','b'),each=2))
##
##    values1 values2 notthis category
## 1:       1      10       0        a
## 2:       1      10       0        a
## 3:       2      20       0        b
## 4:       2      20       0        b

Let's say I want to iterate over all columns of dt except notthis and category. For each column I want to plot two histogram of its values, according to category, and add a vertical line representing their mean values (possibly passing the plots to a pdf device using pdf, print, dev.off).

An idea of code could be as follows:

loopnames <- setdiff(colnames(dt), c('notthis', 'category'))
## [1] "values1" "values2"

for(ZZZ in loopnames){
    dtmeans <- dt[, .(means=mean(ZZZ)), by=category]

    ggplot(dt) + geom_histogram(aes(x=ZZZ, fill=category)) +
                 geom_vline(data=dtmeans, aes(xintercept=means, color=category))
}

but obviously it doesn't work. Use of the ZZZ variable produces errors in data.table and ggplot.

Note the reasons behind some lines of the code:

  • I want to build a list of the columns to iterate through, defined by difference: dt could have hundreds of columns and I only want to exclude, say, two of them.
  • I need to construct a data table, containing the means, to pass to geom_vline (a data table for this is overkill in my opinion, but hey that's what ggplot wants).
  • I'd like to use the special syntax of data.table to construct such data table.

Consulting the useful answers to this post, this post, and this post, I've tried various combinations to make the code-idea above work: using with=FALSE for the data table, the quote()/eval() pair, the "unquote" !! character, as well as as.names() and sym(). But no combination worked out. Closest to solving the problem was the quote()/eval() pair, which seems to work for both data.table and ggplot, but I didn't manage to use this workaround in a for-loop.

Can you suggest a general way without using tidyverse commands to deal with variable/looped column names in packages such as data.table and ggplot?

pglpm
  • 516
  • 4
  • 14

1 Answers1

1

try to use get(ZZZ) in loop body, instead of ZZZ

dy_by
  • 1,061
  • 1
  • 4
  • 13