18

I would like to use the data.table package in R to calculate column means for many columns by another set of columns. I know how to do this for a few columns, and I provide an example below. However, in my non-toy example, I have tens of variables I would like to do this for, and I would like to find a way to do this from a vector of the column names. Is this possible?

library(data.table)

# creates data table
dfo <- data.frame(bananas = 1:5, 
             melonas = 6:10,
             yeah = 11:15,
             its = c(1,1,1,2,2)
             )
dto <- data.table(dfo)

# gets column means by 'its' column
dto[,
.('bananas_mean' = mean(bananas),
  'melonas_mean' = mean(melonas),
  'yeah_mean' = mean(yeah)
  ),
by = .(its)]
BioBroo
  • 613
  • 1
  • 7
  • 21
  • 4
    This http://stackoverflow.com/q/16846380/ ? – Frank May 05 '17 at 14:34
  • 1
    Following up on Frank's comment, to get the naming the way you want you'd only need a slight modification: `cols <- c("bananas", "melonas", "yeah"); dto[,(paste0(cols, "_mean")) := lapply(.SD, mean), by = its, .SDcols = cols]` – Mike H. May 05 '17 at 14:54
  • I think there might be a problem with my `R`. I tried that exact code, and I get no output. Any idea why? – BioBroo May 05 '17 at 14:59
  • Well it's updating `dto` so have you tried looking at `dto` after you run it? – Mike H. May 05 '17 at 15:00
  • Oh that's very interesting. It works! I never saw `R` do this before. – BioBroo May 05 '17 at 15:03
  • @Frank How did you find that previously asked question? I tried to do my due diligence before asking this question, but I didn't find it. – BioBroo May 05 '17 at 15:03
  • @BioBroo I clicked through my profile :) It's my answer there. Btw, you see no output since := updates by reference and so doesn't create a new object. You can add `[]` to the end of a data.table command to print the result, though. – Frank May 05 '17 at 16:03
  • Is there a way to set these new columns as a new object, rather than binding them to the old object? I _could_ remove duplicates, but that seems pretty clunky. – BioBroo May 05 '17 at 21:13
  • This is stupidly simple in dplyr: `dfo %>% group_by(its) %>% summarise_all(funs(mean = mean))` – alistaire May 07 '17 at 18:23
  • @alistaire Indeed. Had I only been so enlightened at the time. – BioBroo Jan 18 '22 at 15:34

3 Answers3

17

Using data.table:

library(data.table)
d <- dto[, lapply(.SD, mean), by=its]

d

   its bananas melonas yeah
1:   1     2.0     7.0 12.0
2:   2     4.5     9.5 14.5

Obviously, other functions could be used and combined. Hope it helps.

COLO
  • 1,024
  • 16
  • 28
14

The OP has requested to calculate column means for many columns ... from a vector of the column names. In addition, the OP has demonstrated in his sample code that he wants to rename the resulting columns.

Both the excepted answer and the solution suggested in this comment do not fully meet all these requirements. The accepted answer computes means for all columns of the data.table and doesn't rename the results. The solution in the comments does use a vector of column names and renames the results but modifies the original data.table while the OP expects a new object.

The requirements of the OP can be met using the code below:

# define columns to compute mean of
cols <- c("bananas", "melonas")
# compute means for selected columns and rename the output
result <- dto[, lapply(.SD, mean), .SDcols = cols, by = its
              ][, setnames(.SD, cols, paste(cols, "mean", sep = "_"))]

result
#   its bananas_mean melonas_mean
#1:   1          2.0          7.0
#2:   2          4.5          9.5

Means are only computed for columns given as character vector of column names, the output columns have been renamed, and dto is unchanged.

Edit Thanks to this comment and this answer, there is a way to make data.table rename the output columns automagically:

result <- dto[, sapply(.SD, function(x) list(mean = mean(x))), .SDcols = cols, by = its]
result
#   its bananas.mean melonas.mean
#1:   1          2.0          7.0
#2:   2          4.5          9.5
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • 1
    Could also do `f <- function(x) list(mean = mean(x)) ; dto[, sapply(.SD, f), .SDcols = cols]` or such. Or for multiple function it could be generalized to `f <- function(x) c(mean = mean(x), max = max(x)) ; dto[, sapply(.SD, f), .SDcols = cols]` or similar or even `f <- function(x) list(mean = mean(x), max = max(x)) ; dto[, unlist(lapply(.SD, f)), .SDcols = cols]` – David Arenburg May 07 '17 at 17:26
  • It is taken from [here](http://stackoverflow.com/questions/29620783/data-table-in-r-apply-multiple-functions-to-multiple-columns/29621821#29621821) btw, which is probably a dupe target of this question – David Arenburg May 07 '17 at 17:39
  • @DavidArenburg Unfortunately, the generalized version with multiple functions doesn't play nicely when combined with grouping – Uwe May 07 '17 at 17:53
  • This is the best I could think of `f <- function(x, y) list(mean = mean(x), max = max(x)) ; dto[, setDT(do.call(rbind.data.frame, lapply(.SD, f)), keep.rownames = TRUE), .SDcols = cols, by = its]` – David Arenburg May 07 '17 at 18:43
0

Adding an option using colMeans

dto[, as.list(colMeans(.SD)), by=its]
   its bananas melonas yeah
1:   1     2.0     7.0 12.0
2:   2     4.5     9.5 14.5

Choosing the columns by name

dto[, as.list(colMeans(.SD[, c("bananas", "melonas")])), by=its]
   its bananas melonas
1:   1     2.0     7.0
2:   2     4.5     9.5

or by range

dto[, as.list(colMeans(.SD[, 2:3])), by=its]
   its melonas yeah
1:   1     7.0 12.0
2:   2     9.5 14.5
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29