I have a big dataset, and I need to summarise most of the columns by one single factor (CODE_PLOT
). This is the list of columns I need to aggregate:
> names(soil)[4:30]
[1] "PH" "CONDUCTIVITY" "K" "CA" "MG" "N_NO3"
[7] "S_SO4" "ALKALINITY" "AL" "DOC" "WATER_CONTENT" "Na"
[13] "AL_LABILE" "FE" "MN" "P" "N_NH4" "CL"
[19] "CR" "NI" "ZN" "CU" "PB" "CD"
[25] "SI" "SAMPLE_VOL" "N_TOTAL"
For those columns I need mean, sd and length values. Since the dataset is big, performance is also important. I have tried aggregate, but didn’t work. I am open to other packages that can do it faster. My try:
soil_variables <- names(soil)[4:30]
soil_by <- "CODE_PLOT"
soilM <- aggregate(soil[soil_variables], by=soil[soil_by],data=soil,
FUN=function(x) c(mn =mean(x),n=length(x)),na.rm=T)
The required output is a data frame with 3 columns per variable: mean, sd an N (27x3 columns+ 1 “by" column)