R tidy row means from subset of columns

Question

I would like to compute summary variables from multiple columns in a data frame. This is possible when typing out all the row names, but I would like to use starts_with() and similar functions. I.e.

df <- data.frame(A1 = rnorm(100, 0, 1),
                 A2 = rnorm(100, 0, 1),
                 A3 = rnorm(100, 0, 1),
                 B1 = rnorm(100, 0, 1),
                 B2 = rnorm(100, 0, 1))

What works:

library(tidyverse)
df %>% mutate(A = (A1 + A2 + A3)/3)
df %>% mutate(A = rowMeans(select(., A1:A3)))

However, the former gets annoying when summarising many variables, while the latter gets incredibly slow very quickly when summarising many rows. I suspect there must be a faster solution.

What does not work:

df %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(starts_with("A")))

So my question is: Is there a way to use mean() etc. within mutate() to compute row means, ideally without having to spell out every single variable?

score 5 · Answer 1 · answered Feb 11 '19 at 10:34

5

Use in your select function the function starts_with:

df %>% mutate(A = rowMeans(select(., starts_with("A"))))

If you search the help for select_helpers you find several more.

answered Feb 11 '19 at 10:34

ricoderks

1,619
9
13

Thanks for this answer. This is essentially my second 'what works' example, except using starts_with(). Unfortunately, this gets very slow very quickly, so I was hoping there would be a faster solution. – simoncolumbus Feb 11 '19 at 10:39
1

Sorry, I missed the performance part! – ricoderks Feb 11 '19 at 11:13

score 4 · Answer 2 · answered Feb 11 '19 at 10:52

Base R is the fastest in my tests.
I will use a bigger dataframe.

library(tidyverse)
library(microbenchmark)
library(ggplot2)

set.seed(1234)

n <- 1e4
df <- data.frame(A1 = rnorm(n, 0, 1),
                 A2 = rnorm(n, 0, 1),
                 A3 = rnorm(n, 0, 1),
                 B1 = rnorm(n, 0, 1),
                 B2 = rnorm(n, 0, 1))

mb <- microbenchmark(
  m1 = df %>% mutate(A = (A1 + A2 + A3)/3),
  m2 = df %>% mutate(A = rowMeans(select(., A1:A3))),
  m3 = df %>% mutate(A = reduce(.[, grepl("^A", names(.))], `+`) / ncol(.[, grepl("^A", names(.))])),
  m4 = rowMeans(df[, grep("^A", names(df))]),
  m5 = df[, grep("^A", names(df))] %>% rowMeans()
)

mb

autoplot(mb)

The purely base R way is faster, followed by the base R subsetting/dplyr pipe.

Nice benchmark! Doesn't impact the performance that much, but should be noted that for the same output `%>% cbind(df, A = .)` (or something similar) would need to be added to `m4` and `m5`. — arg0naut91, Feb 11 '19 at 11:14
@arg0naut Or `%>% bind_cols(df, A = .)`, to use a `dplyr` function. — Rui Barradas, Feb 11 '19 at 12:59

arg0naut91 · Accepted Answer · 2019-02-11T11:28:52.590

2

In my previous version I thought that rowMeans is the concern, but actually what is slowing down the calculation is the usage of select - better just stick with the grep family:

df %>% mutate(A = rowMeans(.[, grepl("^A", names(.))]))

edited Feb 11 '19 at 11:28

answered Feb 11 '19 at 10:48

arg0naut91

14,574
2
17
38

R tidy row means from subset of columns

3 Answers3