Selecting a range of variables in R

Question

I need to summarize some data and I'm using the function ddply from plyr package. The dataset has 68 variables and I'm trying to take the mean of all variables, grouped by other two.

I'm trying to use the following code but it isn't working.

ddply(data, c("Var1", "Var2"), summarise, mean = mean(data$Var3 ~ data$Var68))

It shows me this message:

There were 50 or more warnings (use warnings() to see the first 50)

What is the problem in that?

P.S.: Var1 is a factor with 6 levels, Var2 is an int. All other variables are num.

`dplyr` is the successor to `plyr`, and is well-suited here. While you should really add enough data to the question to make it reproducible, it would be something like `data %>% group_by(Var1, Var2) %>% summarise_all(mean)` — alistaire, Jul 17 '16 at 21:59

score 4 · Answer 1 · answered Jul 17 '16 at 22:07

4

data.table approach:

library(data.table)
setDT(data)

data[ , lapply(.SD, mean), by = .(Var1, Var2)]

Add bells and whistles to taste.

answered Jul 17 '16 at 22:07

MichaelChirico

33,841
14
113
198

I didn't understand what is `.SD` in the `lapply` function – Rods2292 Jul 17 '16 at 22:18
1

`.SD` is a `data.table`-specific temporary variable created within `[]`. Basically, it represents the whole dataset _within_ each `by` group. The mnemonic is **S**ubset of the **D**ata, since it can be manipulated with the `.SDcols` argument to represent a subset of columns; as it stands, it represents _all_ columns of `data`. – MichaelChirico Jul 17 '16 at 22:52

score 1 · Answer 2 · answered Jul 18 '16 at 00:55

If you want a base R method, you can use aggregate. Here is a working example:

aggregate(. ~ g1 + g2, data=df, FUN=mean)
  g1 g2          a         b         c
1  1  0  0.3163713 0.4030635 0.4926396
2  2  0 -0.8909029 0.4211550 0.3286698
3  1  1 -0.5466319 0.9146582 0.2588098
4  2  1 -0.6130626 0.2997645 0.5848791

This calculates the mean of three variables for two grouping variables. The same code will calculate the mean for 65 variables.

data

set.seed(1234)
df <- data.frame(a=rnorm(10), b=runif(10),
                 g1=sample(1:2, 10, replace=T), g2=rep(0:1, 5))

score 1 · Accepted Answer · answered Jul 18 '16 at 02:30

1

We can use dplyr

library(dplyr)
data %>%
     group_by(Var1, Var2) %>%
     summarise_each(funs(mean = mean(., na.rm = TRUE)))

answered Jul 18 '16 at 02:30

akrun

874,273
37
540
662

Selecting a range of variables in R

3 Answers3