11

Say I have the following data:

colA <- c("SampA", "SampB", "SampC")
colB <- c(21, 20, 30)
colC <- c(15, 14, 12)
colD <- c(10, 22, 18)
df <- data.frame(colA, colB, colC, colD)
df
#    colA colB colC colD
# 1 SampA   21   15   10
# 2 SampB   20   14   22
# 3 SampC   30   12   18

I want to get the row means and standard deviations for the values in columns B-D.

I can calculate the rowMeans as follows:

library(dplyr)
df %>% select(., matches("colB|colC|colD")) %>% mutate(rmeans = rowMeans(.))
#   colB colC colD   rmeans
# 1   21   15   10 15.33333
# 2   20   14   22 18.66667
# 3   30   12   18 20.00000

But when I try to calculate the standard deviation using sd(), it throws up an error.

df %>% select(., matches("colB|colC|colD")) %>% mutate(rsds = sapply(., sd(.)))
Error in is.data.frame(x) : 
  (list) object cannot be coerced to type 'double'

So my question is: how do I calculate the standard deviations here?

Edit: I tried sapply() with sd() having read the first answer here.

Additional edit: not necessarily looking for a 'tidy' solution (base R also works just fine).

Braiam
  • 1
  • 11
  • 47
  • 78
Dunois
  • 1,813
  • 9
  • 22

7 Answers7

6

I'm not sure how old/new dplyr's c_across functionality is relative to the prior answers on this page, but here's a solution that is almost directly cut and pasted from the documentation for dplyr::c_across:

df %>% 
  rowwise() %>% 
  mutate(
     mean = mean(c_across(colB:colD)),
     sd = sd(c_across(colB:colD))
  )

# A tibble: 3 x 6
# Rowwise: 
  colA   colB  colC  colD  mean    sd
  <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SampA    21    15    10  15.3  5.51
2 SampB    20    14    22  18.7  4.16
3 SampC    30    12    18  20    9.17
D. Woods
  • 3,004
  • 3
  • 29
  • 37
  • 1
    This is definitely how I'd do it now. And, I guess `c_across` came out much later? This [post](https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/) by Hadley Wickham is from 2020. – Dunois Mar 03 '21 at 14:22
  • 1
    I appreciated this answer for a use case with many columns that I wanted to use in the rowwise calculation. Saved me from typing them all out. – Andrew Jackson Mar 05 '21 at 21:55
5

Try this (using), withrowSds from the matrixStats package,

library(dplyr)
library(matrixStats)

columns <- c('colB', 'colC', 'colD')

df %>% 
  mutate(Mean= rowMeans(.[columns]), stdev=rowSds(as.matrix(.[columns])))

Returns

   colA colB colC colD     Mean    stdev
1 SampA   21   15   10 15.33333 5.507571
2 SampB   20   14   22 18.66667 4.163332
3 SampC   30   12   18 20.00000 9.165151

Your data

colA <- c("SampA", "SampB", "SampC")
colB <- c(21, 20, 30)
colC <- c(15, 14, 12)
colD <- c(10, 22, 18)
df <- data.frame(colA, colB, colC, colD)
df
Hector Haffenden
  • 1,360
  • 10
  • 25
4

A different dplyr and tidyr approach could be:

df %>% 
 pivot_longer(-1) %>%
 group_by(colA) %>%
 mutate(rsds = sd(value)) %>%
 pivot_wider(names_from = "name",
             values_from = "value")

  colA   rsds  colB  colC  colD
  <fct> <dbl> <dbl> <dbl> <dbl>
1 SampA  5.51    21    15    10
2 SampB  4.16    20    14    22
3 SampC  9.17    30    12    18

Or alternatively, using rowwise() and do():

 df %>% 
 rowwise() %>%
 do(data.frame(., rsds = sd(unlist(.[2:length(.)]))))

  colA   colB  colC  colD  rsds
* <fct> <dbl> <dbl> <dbl> <dbl>
1 SampA    21    15    10  5.51
2 SampB    20    14    22  4.16
3 SampC    30    12    18  9.17

Or an option since dplyr 1.0.0:

df %>% 
 rowwise() %>%
 mutate(rsds = sd(c_across(-1)))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
3

Here is another way using pmap to get the rowwise mean and sd

library(purrr)
library(dplyr)
library(tidur_
f1 <- function(x) tibble(Mean = mean(x), SD = sd(x))
df %>% 
  # select the numeric columns
  select_if(is.numeric) %>%
  # apply the f1 rowwise to get the mean and sd in transmute
  transmute(out = pmap(.,  ~ f1(c(...)))) %>% 
  # unnest the list column
  unnest %>%
  # bind with the original dataset
  bind_cols(df, .)
#   colA colB colC colD     Mean       SD
#1 SampA   21   15   10 15.33333 5.507571
#2 SampB   20   14   22 18.66667 4.163332
#3 SampC   30   12   18 20.00000 9.165151
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I'm sure this has probably been asked somewhere (and I can't seem to get an answer from a quick Google search), but what is the significance of `c(...)`? – Dunois Mar 24 '19 at 21:42
  • 1
    @Dunois We are capturing all the row elements with `...` and concatenating (`c`) into a vector – akrun Mar 25 '19 at 03:54
3

You can use pmap, or rowwise (or group by colA) along with mutate :

library(tidyverse)
df %>% mutate(sd = pmap(.[-1], ~sd(c(...)))) # same as transform(df, sd = apply(df[-1],1,sd))
#>    colA colB colC colD       sd
#> 1 SampA   21   15   10 5.507571
#> 2 SampB   20   14   22 4.163332
#> 3 SampC   30   12   18 9.165151

df %>% rowwise() %>% mutate(sd = sd(c(colB,colC,colD)))
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 5
#>   colA   colB  colC  colD    sd
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 SampA    21    15    10  5.51
#> 2 SampB    20    14    22  4.16
#> 3 SampC    30    12    18  9.17

df %>% group_by(colA) %>% mutate(sd = sd(c(colB,colC,colD)))
#> # A tibble: 3 x 5
#> # Groups:   colA [3]
#>   colA   colB  colC  colD    sd
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 SampA    21    15    10  5.51
#> 2 SampB    20    14    22  4.16
#> 3 SampC    30    12    18  9.17
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • I have noticed that for `dplyr` 0.8.3 and `tidyverse` 1.2.1 none of these solutions works with `select`, e.g with `sd(select(.,-colA))` irrespective of using `group_by` or `rowwise`. Any thoughts on that? – Fourier Dec 17 '20 at 07:50
  • I'm not sure what you tried but you might have misunderstood the dot insertion rules of magrittr. `df %>% sd(select(.,-colA))` is equivalent to `df %>% sd(., select(.,-colA))` – moodymudskipper Dec 17 '20 at 09:33
  • So which would be the correct syntax in the case of `select`ing columns (with or without `rowwise()`) prior to the operation? – Fourier Dec 17 '20 at 09:44
  • 1
    Something like `df %>% select(-colA) %>% mutate(sd = pmap(., ~sd(c(...))))` ? – moodymudskipper Dec 17 '20 at 10:20
  • 1
    Yes, thank you. `pmap` it has to be here. That works like a charm! – Fourier Dec 17 '20 at 10:38
3

I see this post is a bit old, but there are some pretty complicated answers so I thought I'd suggest an easier (and faster) approach.

Calculating means of rows is trivial, just use rowMeans:

rowMeans(df[, c('colB', 'colC', 'colD')])

This is vectorised and very fast.

There is no 'rowSd' function, but it is not hard to write one. Here is my 'rowVars' that I use.

rowVars <- function(x, na.rm=F) {
    # Vectorised version of variance filter
    rowSums((x - rowMeans(x, na.rm=na.rm))^2, na.rm=na.rm) / (ncol(x) - 1)
}

To calculate sd:

sqrt(rowVars(df[, c('colB', 'colC', 'colD')]))

Again, vectorised and fast which can be important if the input matrix is large.

randr
  • 255
  • 1
  • 7
2

Package magrittr pipes %>% are not a good way to process by rows.
Maybe the following is what you want.

df %>% 
  select(-colA) %>%
  t() %>% as.data.frame() %>%
  summarise_all(sd)
#        V1       V2       V3
#1 5.507571 4.163332 9.165151
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you for pointing that out. I am never sure when to attempt the `tidyverse` approach and when to stick to base R. I should have probably mentioned in the OP that I wasn't necessarily looking for a piped solution? – Dunois Mar 24 '19 at 18:49
  • 2
    @Dunois Maybe yes, but the question is tagged `tidyverse` and pipes are a really nice way to process data. I mentioned it mostly because I tried `rowwise()` and couldn't get it to work and so resorted to `t() %>% as.data.frame()`. – Rui Barradas Mar 24 '19 at 19:02
  • 2
    Here's a way to make `rowwise` work : `df %>% rowwise() %>% summarize(sd = sd(c(colB,colC,colD)))` – moodymudskipper Mar 25 '19 at 14:34
  • @Moody_Mudskipper You should post it as an answer. – Rui Barradas Mar 25 '19 at 15:10