2

I need to summarize some data and I'm using the function ddply from plyr package. The dataset has 68 variables and I'm trying to take the mean of all variables, grouped by other two.

I'm trying to use the following code but it isn't working.

ddply(data, c("Var1", "Var2"), summarise, mean = mean(data$Var3 ~ data$Var68))

It shows me this message:

There were 50 or more warnings (use warnings() to see the first 50)

What is the problem in that?

P.S.: Var1 is a factor with 6 levels, Var2 is an int. All other variables are num.

Rods2292
  • 665
  • 2
  • 10
  • 28
  • 2
    `dplyr` is the successor to `plyr`, and is well-suited here. While you should really add enough data to the question to make it reproducible, it would be something like `data %>% group_by(Var1, Var2) %>% summarise_all(mean)` – alistaire Jul 17 '16 at 21:59

3 Answers3

4

data.table approach:

library(data.table)
setDT(data)

data[ , lapply(.SD, mean), by = .(Var1, Var2)]

Add bells and whistles to taste.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • I didn't understand what is `.SD` in the `lapply` function – Rods2292 Jul 17 '16 at 22:18
  • 1
    `.SD` is a `data.table`-specific temporary variable created within `[]`. Basically, it represents the whole dataset _within_ each `by` group. The mnemonic is **S**ubset of the **D**ata, since it can be manipulated with the `.SDcols` argument to represent a subset of columns; as it stands, it represents _all_ columns of `data`. – MichaelChirico Jul 17 '16 at 22:52
1

If you want a base R method, you can use aggregate. Here is a working example:

aggregate(. ~ g1 + g2, data=df, FUN=mean)
  g1 g2          a         b         c
1  1  0  0.3163713 0.4030635 0.4926396
2  2  0 -0.8909029 0.4211550 0.3286698
3  1  1 -0.5466319 0.9146582 0.2588098
4  2  1 -0.6130626 0.2997645 0.5848791

This calculates the mean of three variables for two grouping variables. The same code will calculate the mean for 65 variables.

data

set.seed(1234)
df <- data.frame(a=rnorm(10), b=runif(10),
                 g1=sample(1:2, 10, replace=T), g2=rep(0:1, 5))
lmo
  • 37,904
  • 9
  • 56
  • 69
1

We can use dplyr

library(dplyr)
data %>%
     group_by(Var1, Var2) %>%
     summarise_each(funs(mean = mean(., na.rm = TRUE)))
akrun
  • 874,273
  • 37
  • 540
  • 662