28

I am using in my code colSums but I also need the standard deviation beside the sum. I searched in the internet and found this page which contain only:

colSums
colMeans

http://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html

I tried this:

colSd

but I got this error:

Error: could not find function "colSd"

How I can do the same thing but for standard deviation:

colSd

Here is the code:

results <- colSums(x,na.rm=TRUE)#### here I want colsd
Roman
  • 17,008
  • 3
  • 36
  • 49
sacvf
  • 2,463
  • 5
  • 36
  • 54

8 Answers8

27

I want to provide a fourth (very similar to @Thomas) approach and some benchmarking:

library("microbenchmark")
library("matrixStats")

colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds

colSdColMeans <- function(x, na.rm=TRUE) {
  if (na.rm) {
    n <- colSums(!is.na(x)) # thanks @flodel
  } else {
    n <- nrow(x)
  }
  colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
  return(sqrt(colVar * n/(n-1)))
}

colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))

m <- matrix(runif(1e7), nrow=1e3)

microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))

# Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#        colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783   100
#  colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715   100
#     colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407   100
#       colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806   100


all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE
sgibb
  • 25,396
  • 3
  • 68
  • 74
  • 1
    @sacvf: I missed the `...` in `colMeans` (see my edit). Now `colSdColMeans(x, na.rm=TRUE)` should work. – sgibb Jul 09 '13 at 15:22
  • 2
    I think to deal with `NA` you will have to use something like `n <- colSums(!is.na(x))`. – flodel Jul 09 '13 at 20:08
  • +1 for the benchmarking of different methods. This was useful and new for me. – asb Jul 09 '13 at 21:10
  • @flodel thanks, you are right, I forgot that! (I update my code and the benchmark results.) – sgibb Jul 09 '13 at 22:14
  • 1
    In the meantime the benchmark seems to be outdated and `matrixStats::colSds` 4-10x faster than the rest. – jay.sf Apr 29 '19 at 08:40
7

colSds and rowSds are two of many similar functions in the matrixStats package

Henry
  • 6,704
  • 2
  • 23
  • 39
6

This is the quickest and shortest way to calculate the standard deviation of the columns:

sqrt(diag(cov(data_matrix)))

Since the diagonal of a co-variance matrix consists of the variances of each variable, we do the following:

  • Calculate the co-variance matrix using cov
  • Extract the diagonal of the matrix using diag
  • Take the square root of the diagonal values using sqrt in order to get the standard deviation

I hope that helps :)

5

Use the following:

colSd <- function (x, na.rm=FALSE) apply(X=x, MARGIN=2, FUN=sd, na.rm=na.rm)
asb
  • 4,392
  • 1
  • 20
  • 30
  • I would do `function(x, ...) apply(X = x, ...))`. – Roman Luštrik Jul 09 '13 at 13:45
  • I debated that myself, but then sd uses only one optional, so I thought of including that directly. – asb Jul 09 '13 at 13:46
  • 7
    @sacvf, it doesn't matter how many comments you pour in with "my results are NA, any idea why?", we can't help unless we see the data you're having a problem with. You should direct some of your energy into making your question reproducible. – Arun Jul 09 '13 at 14:21
  • I also tried the solution given by `@Henry` but I got the smae:Just `NA` – sacvf Jul 09 '13 at 14:28
  • @savcf: I had disappeared for some time. You can't seriously ask something like 'what you meant if check na.rm'. You should really produce a reproducible example to get a correct solution. – asb Jul 09 '13 at 21:08
4

I don't know if these are particularly fast, but why not just use the formulae for SD:

x <- data.frame(y = rnorm(1000,0,1), z = rnorm(1000,2,3))

# If you have a population:
colsdpop <- function(x,...)
     sqrt(rowMeans((t(x)-colMeans(x,...))^2,...))
colsdpop(x)
sd(x$y); sd(x$z) # won't match `sd`

# If you have a sample:
colsdsamp <- function(x)
    sqrt( (rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1))) )
colsdsamp(x)
sd(x$y); sd(x$z) # will match `sd`

Note: the sample solution won't handle NAs well. One could incorporate something like apply(x,2,function(z) sum(!is.na(z))) into the right-most part of the formula to get an appropriate denominator, but it would get really murky quite quickly.

Thomas
  • 43,637
  • 12
  • 109
  • 140
4

I believe I have found a more elegant solution in diag(sqrt(var(data)))

This worked for me to get the standard deviation of each of my columns. However, it does compute a bunch of extra unnecessary covariances (and their square roots) along the way, so it isn't necessarily the most efficient approach. But if your data is small, it works excellently.

EDIT: I just realized that sqrt(diag(var(data))) is probably a bit more efficient, since it drops the unnecessary covariance terms earlier.

jmajic
  • 49
  • 1
3

I usually do column sd's with apply:

x <- data.frame(y = rnorm(20,0,1), z = rnorm(20,2,3))

> apply(x, 2, sd)
        y         z 
0.8022729 3.4700314 

Verify:

> sd(x$y)
[1] 0.8022729

> sd(x$z)
[1] 3.470031

You can also do it with dplyr easily:

library(dplyr)
library(magrittr) # for pipes

> x %>% summarize_all(.,sd)
          y        z
1 0.8022729 3.470031
mysteRious
  • 4,102
  • 2
  • 16
  • 36
1

You can just use apply function

all.sd <- apply(data, 2,sd)