1

I am repeatedly doing data.table commands that apply the same function to a bunch of columns, with a group.by statement, and I'm having trouble simplifying/generalizing the command. For example, "output" below is correct, but I think it should be possible to write a more concise command to achieve the same output. "output_fail" is my attempt to do so:

library(data.table)

df <- data.table(a = c(1, 1, 2, 2),
                 b = c(2, 4, 6, 6),
                 c = c(1, 3, 7, 10),
                 d = c(1, 5, 1, 5)
                 )

output = df[, .(b = sum(b, na.rm = TRUE),
                c = sum(c, na.rm = TRUE), 
                d = sum(d, na.rm = TRUE)
                ),
            by = a]


cols = c('b', 'c', 'd')

output_fail <- df[,(cols) := lapply(cols, function(x) sum(x, na.rm = TRUE))
                           , by = a
]

How could I write a line to produce output more cleanly given "cols"?

Sam Asin
  • 131
  • 9
  • By using := you're modifying df in-place, that is, by reference. So you do not need to assign the result to output_fail. – Ric Apr 01 '23 at 20:35

1 Answers1

1

We could use mget to get the columns in cols and apply the sum function to them:

df[, lapply(mget(cols), sum, na.rm = TRUE), by = a]

   a  b  c d
1: 1  6  4 6
2: 2 12 17 6
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    I think I would favour using `.SDcols` and use `df[, lapply(.SD, sum, na.rm = TRUE), .SDcols = cols, by = a]` or if you want to summarize all columns you group by just do `df[, lapply(.SD, sum, na.rm = TRUE), by = a]` – Merijn van Tilborg Apr 03 '23 at 08:31