In R, simplifying long data.table commands (probably combining Data.table's "group by", lapply, and a vector of column names)

Question

I am repeatedly doing data.table commands that apply the same function to a bunch of columns, with a group.by statement, and I'm having trouble simplifying/generalizing the command. For example, "output" below is correct, but I think it should be possible to write a more concise command to achieve the same output. "output_fail" is my attempt to do so:

library(data.table)

df <- data.table(a = c(1, 1, 2, 2),
                 b = c(2, 4, 6, 6),
                 c = c(1, 3, 7, 10),
                 d = c(1, 5, 1, 5)
                 )

output = df[, .(b = sum(b, na.rm = TRUE),
                c = sum(c, na.rm = TRUE), 
                d = sum(d, na.rm = TRUE)
                ),
            by = a]


cols = c('b', 'c', 'd')

output_fail <- df[,(cols) := lapply(cols, function(x) sum(x, na.rm = TRUE))
                           , by = a
]

How could I write a line to produce output more cleanly given "cols"?

By using := you're modifying df in-place, that is, by reference. So you do not need to assign the result to output_fail. — Ric, Apr 01 '23 at 20:35

score 1 · Accepted Answer · answered Apr 01 '23 at 20:31

1

We could use mget to get the columns in cols and apply the sum function to them:

df[, lapply(mget(cols), sum, na.rm = TRUE), by = a]

   a  b  c d
1: 1  6  4 6
2: 2 12 17 6

answered Apr 01 '23 at 20:31

TarJae

72,363
6
19
66

1

I think I would favour using `.SDcols` and use `df[, lapply(.SD, sum, na.rm = TRUE), .SDcols = cols, by = a]` or if you want to summarize all columns you group by just do `df[, lapply(.SD, sum, na.rm = TRUE), by = a]` – Merijn van Tilborg Apr 03 '23 at 08:31

In R, simplifying long data.table commands (probably combining Data.table's "group by", lapply, and a vector of column names)

1 Answers1