What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?

Question

I am looking to generate a mean and variance value for every row in a numeric tibble. With my existing code, what I thought to be a very dplyr appropriate solution, it takes a number of hours to complete for 50,000 rows of about 35 columns.

Is there a way to speed up this operation using only dplyr? I know apply and purrr are options, but I am mostly curious if there is something about dplyr I'm overlooking when performing a large series of calculations like this.

Reproducible example:

library(tidyverse)
library(vroom)
gen_tbl(50000, cols = 40, 
        col_types = paste0(rep("d", 40), collapse = "")) %>%
  rowwise() %>%
  mutate(mean = mean(c_across()),
         var = var(c_across()))

My suspicion lies with rowwise() but I am interested if there is a more nuanced way to solving this with dplyr or if it is just not a problem dplyr is good at.

Maybe work with a matrix and use `?rowMeans` and `rowSD` as per this suggestion https://stackoverflow.com/a/29581842/8583393 (sidenote: `rowMeans` will convert to a matrix anyhow) — markus, Jan 24 '22 at 22:41

Jon Spring · Accepted Answer · 2022-01-24T23:49:47.957

5

It seems like the processing time for the rowwise approach explodes quadratically:

Pivoting longer makes the calculation about 300x faster. For 50k rows, the code below took 1.2 seconds, compared to 372 seconds for the rowwise approach.

df %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_longer(-row) %>%
  group_by(row) %>%
  summarize(mean = mean(value),
            var = var(value)) %>%
  bind_cols(df, .)

edited Jan 24 '22 at 23:49

answered Jan 24 '22 at 22:43

Jon Spring

55,165
4
35
53

Wow what a steep curve! Definitely like this solution though, I had a suspicion the shape could be better. Thanks! – Taylor F Jan 25 '22 at 14:53

score 3 · Answer 2 · answered Jan 24 '22 at 23:07

If you can get a row-wise variance function like matrixStats::rowVars() you can do this. I'm not sure why c_across() doesn't do it but across() seems to work. I got it from this issue

library(tidyverse)
library(vroom)

gen_tbl(
  rows = 50000, 
  cols = 40, 
  col_types = paste0(rep("d", 40), collapse = "")
) %>%
  mutate(
    mean = rowMeans(across(everything())),
    # var = rowVars(across(everything())),
    .before = everything()
  )

# A tibble: 50,000 x 41
#    mean     X1     X2   
#    <dbl>  <dbl>  <dbl>  
# 1  0.199   -0.715 -1.64   
# 2 -0.212   -0.983 -1.38   
# 3  0.126   -0.135  0.263

What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?

2 Answers2