4

I would like to compute summary variables from multiple columns in a data frame. This is possible when typing out all the row names, but I would like to use starts_with() and similar functions. I.e.

df <- data.frame(A1 = rnorm(100, 0, 1),
                 A2 = rnorm(100, 0, 1),
                 A3 = rnorm(100, 0, 1),
                 B1 = rnorm(100, 0, 1),
                 B2 = rnorm(100, 0, 1))

What works:

library(tidyverse)
df %>% mutate(A = (A1 + A2 + A3)/3)
df %>% mutate(A = rowMeans(select(., A1:A3)))

However, the former gets annoying when summarising many variables, while the latter gets incredibly slow very quickly when summarising many rows. I suspect there must be a faster solution.

What does not work:

df %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(starts_with("A")))

So my question is: Is there a way to use mean() etc. within mutate() to compute row means, ideally without having to spell out every single variable?

simoncolumbus
  • 526
  • 1
  • 8
  • 23

3 Answers3

5

Use in your select function the function starts_with:

df %>% mutate(A = rowMeans(select(., starts_with("A"))))

If you search the help for select_helpers you find several more.

ricoderks
  • 1,619
  • 9
  • 13
  • Thanks for this answer. This is essentially my second 'what works' example, except using starts_with(). Unfortunately, this gets very slow very quickly, so I was hoping there would be a faster solution. – simoncolumbus Feb 11 '19 at 10:39
  • 1
    Sorry, I missed the performance part! – ricoderks Feb 11 '19 at 11:13
4

Base R is the fastest in my tests.
I will use a bigger dataframe.

library(tidyverse)
library(microbenchmark)
library(ggplot2)

set.seed(1234)

n <- 1e4
df <- data.frame(A1 = rnorm(n, 0, 1),
                 A2 = rnorm(n, 0, 1),
                 A3 = rnorm(n, 0, 1),
                 B1 = rnorm(n, 0, 1),
                 B2 = rnorm(n, 0, 1))

mb <- microbenchmark(
  m1 = df %>% mutate(A = (A1 + A2 + A3)/3),
  m2 = df %>% mutate(A = rowMeans(select(., A1:A3))),
  m3 = df %>% mutate(A = reduce(.[, grepl("^A", names(.))], `+`) / ncol(.[, grepl("^A", names(.))])),
  m4 = rowMeans(df[, grep("^A", names(df))]),
  m5 = df[, grep("^A", names(df))] %>% rowMeans()
)

mb

autoplot(mb)

The purely base R way is faster, followed by the base R subsetting/dplyr pipe.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Nice benchmark! Doesn't impact the performance that much, but should be noted that for the same output `%>% cbind(df, A = .)` (or something similar) would need to be added to `m4` and `m5`. – arg0naut91 Feb 11 '19 at 11:14
  • 1
    @arg0naut Or `%>% bind_cols(df, A = .)`, to use a `dplyr` function. – Rui Barradas Feb 11 '19 at 12:59
2

In my previous version I thought that rowMeans is the concern, but actually what is slowing down the calculation is the usage of select - better just stick with the grep family:

df %>% mutate(A = rowMeans(.[, grepl("^A", names(.))]))
arg0naut91
  • 14,574
  • 2
  • 17
  • 38