The OP wants to compute the aggregate for each variable separately but wants to combine the separate results in one data frame. In SQL
the latter step would correspond to a UNION
operation.
With data.table
, this can be achieved by rbindlist()
:
library(data.table)
setDT(test)
cols <- c("Var1", "Var2")
rbindlist(
lapply(cols, function(.col) test[, sum(x), by = .col]), idcol = TRUE
)[, setnames(.SD, c("var", "lvl", "sum_x"))][, var := factor(var, labels = cols)][]
# var lvl sum_x
#1: Var1 A 6
#2: Var1 B 5
#3: Var2 I 4
#4: Var2 II 3
#5: Var2 III 4
Explanation
- Store the variable names to loop over in
cols
.
- Loop over the variables using
lapply()
and for each variable compute the aggregates grouped by the levels within the variable. The result is a list of data.tables.
- The list is combined row-wise using
rbindlist
. The id column gives the number of the list element the rows are taken from.
- Rename the columns.
- Convert the number in the id column to the names of the variables.
Note, we could have named the list elements with the variable names by setNames()
before calling rbindlist()
but this would come with a perfomance penalty due to copying the result. Instead, this is done in the last step where only factor labels have to be assigned to the numbers in the id column.
Of course, there is an alternative approach (similar to this answer) which first reshapes the data from wide to long format and then does the aggregates grouped by variable and level.
melt(setDT(test), "x", value.name = "level")[, .(sum_x = sum(x)), by = .(variable, level)]
# variable level sum_x
#1: Var1 A 6
#2: Var1 B 5
#3: Var2 I 4
#4: Var2 II 3
#5: Var2 III 4
data.table
reports no issues with factor levels.
It would be interesting to see which of the two approaches is more efficient in terms of memory consumption and speed for large problems.