Aggregate big dataset for a list of columns and using different FUN

Question

I have a big dataset, and I need to summarise most of the columns by one single factor (CODE_PLOT). This is the list of columns I need to aggregate:

> names(soil)[4:30]
 [1] "PH"            "CONDUCTIVITY"  "K"             "CA"            "MG"            "N_NO3"        
 [7] "S_SO4"         "ALKALINITY"    "AL"            "DOC"           "WATER_CONTENT" "Na"           
[13] "AL_LABILE"     "FE"            "MN"            "P"             "N_NH4"         "CL"           
[19] "CR"            "NI"            "ZN"            "CU"            "PB"            "CD"           
[25] "SI"            "SAMPLE_VOL"    "N_TOTAL"

For those columns I need mean, sd and length values. Since the dataset is big, performance is also important. I have tried aggregate, but didn’t work. I am open to other packages that can do it faster. My try:

soil_variables <- names(soil)[4:30]
soil_by <- "CODE_PLOT"
soilM <- aggregate(soil[soil_variables], by=soil[soil_by],data=soil,
                   FUN=function(x) c(mn =mean(x),n=length(x)),na.rm=T)

The required output is a data frame with 3 columns per variable: mean, sd an N (27x3 columns+ 1 “by" column)

By the way, your answerers have had to make up example data. It would be better if you provide a reproducible example in future questions http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Frank, Feb 25 '15 at 17:30

score 3 · Answer 1 · answered Feb 25 '15 at 17:28

3

library(data.table)
dt = data.table(a = 1:2, b = 1:10, c = 2:11)

dt[, as.list(unlist(lapply(.SD, function(x) c(mn = mean(x), sd = sd(x), n = .N))))
   , by = a]
#   a b.mn     b.sd b.n c.mn     c.sd c.n
#1: 1    5 3.162278   5    6 3.162278   5
#2: 2    6 3.162278   5    7 3.162278   5

You can add .SDcols to specify which columns you want this performed on.

answered Feb 25 '15 at 17:28

eddi

49,088
6
104
155

2

I'm planning to add an exhaustive set of `rowwise` and `colwise` functions to `.SD`. Will add a new issue. This should get plentiful straightforward when done. – Arun Feb 25 '15 at 17:54

talat · Accepted Answer · 2015-02-25T17:27:38.470

You can use dplyr's summarise_each function combined with group_by:

library(dplyr)
soil %>%
  group_by(CODE_PLOT) %>%
  summarise_each(funs(mean = mean(., na.rm = TRUE), 
                      sd = sd(., na.rm = TRUE), 
                      N = n()), 4:30)

This will summarise the columns 4:30 of your data.

If you want to supply a vector of column names to be summarised (as soil_variables in your example), you can do it like this:

soil_variables <- names(soil)[4:30]

soil %>%
  group_by(CODE_PLOT) %>%
  summarise_each_(funs(mean = mean(., na.rm = TRUE),      # note "summarise_each_"
                      sd = sd(., na.rm = TRUE), 
                      N = n()), soil_variables)

Here's an example with the "iris" data set summarising the two columns by group of "Species":

data(iris)
iris %>% 
  group_by(Species) %>% 
  summarise_each(
    funs(
      mean = mean(., na.rm = TRUE),
      sd = sd(., na.rm = TRUE),
      N = n()
      ), 1:2)

#Source: local data frame [3 x 7]
#
#     Species Sepal.Length_mean Sepal.Width_mean Sepal.Length_sd Sepal.Width_sd Sepal.Length_N Sepal.Width_N
#1     setosa             5.006            3.428       0.3524897      0.3790644             50            50
#2 versicolor             5.936            2.770       0.5161711      0.3137983             50            50
#3  virginica             6.588            2.974       0.6358796      0.3224966             50            50

Aggregate big dataset for a list of columns and using different FUN

2 Answers2