Is there such "colsd" in R?

Question

I am using in my code colSums but I also need the standard deviation beside the sum. I searched in the internet and found this page which contain only:

colSums
colMeans

http://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html

I tried this:

colSd

but I got this error:

Error: could not find function "colSd"

How I can do the same thing but for standard deviation:

colSd

Here is the code:

results <- colSums(x,na.rm=TRUE)#### here I want colsd

No, but if you have a data.frame, try `sapply(x, sd)` or more general, `apply(x, 2, sd)`. — Roman Luštrik, Jul 09 '13 at 13:43
https://stat.ethz.ch/pipermail/r-help/2002-March/019606.html — David Heffernan, Jul 09 '13 at 13:44
I also like the `numcolwise` function from the `plyr` package for this type of thing. — Justin, Jul 09 '13 at 13:44
Yes, it'd be nice to have such functions. `colMeans` and `colSums` are much faster than `apply(X, 2, ...)` counterparts... — Arun, Jul 09 '13 at 13:48

sgibb · Accepted Answer · 2013-07-09T22:17:46.637

27

I want to provide a fourth (very similar to @Thomas) approach and some benchmarking:

library("microbenchmark")
library("matrixStats")

colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds

colSdColMeans <- function(x, na.rm=TRUE) {
  if (na.rm) {
    n <- colSums(!is.na(x)) # thanks @flodel
  } else {
    n <- nrow(x)
  }
  colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
  return(sqrt(colVar * n/(n-1)))
}

colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))

m <- matrix(runif(1e7), nrow=1e3)

microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))

# Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#        colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783   100
#  colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715   100
#     colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407   100
#       colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806   100


all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE

edited Jul 09 '13 at 22:17

answered Jul 09 '13 at 15:00

sgibb

25,396
3
68
74

1

@sacvf: I missed the `...` in `colMeans` (see my edit). Now `colSdColMeans(x, na.rm=TRUE)` should work. – sgibb Jul 09 '13 at 15:22
2

I think to deal with `NA` you will have to use something like `n <- colSums(!is.na(x))`. – flodel Jul 09 '13 at 20:08
+1 for the benchmarking of different methods. This was useful and new for me. – asb Jul 09 '13 at 21:10
@flodel thanks, you are right, I forgot that! (I update my code and the benchmark results.) – sgibb Jul 09 '13 at 22:14
1

In the meantime the benchmark seems to be outdated and `matrixStats::colSds` 4-10x faster than the rest. – jay.sf Apr 29 '19 at 08:40

score 7 · Answer 2 · answered Jul 09 '13 at 13:57

7

colSds and rowSds are two of many similar functions in the matrixStats package

answered Jul 09 '13 at 13:57

Henry

6,704
2
23
39

5

Those functions in that package aren't actually doing anything fancy; they are just as slow as `apply(x, 2, sd)`. – nograpes Jul 09 '13 at 14:07
3

Not anymore. They are all written in C/C++ these days – David Arenburg May 10 '18 at 06:41

score 6 · Answer 3 · answered Jan 23 '18 at 05:38

This is the quickest and shortest way to calculate the standard deviation of the columns:

sqrt(diag(cov(data_matrix)))

Since the diagonal of a co-variance matrix consists of the variances of each variable, we do the following:

Calculate the co-variance matrix using cov
Extract the diagonal of the matrix using diag
Take the square root of the diagonal values using sqrt in order to get the standard deviation

I hope that helps :)

score 5 · Answer 4 · answered Jul 09 '13 at 13:44

5

Use the following:

colSd <- function (x, na.rm=FALSE) apply(X=x, MARGIN=2, FUN=sd, na.rm=na.rm)

answered Jul 09 '13 at 13:44

asb

4,392
1
20
30

I would do `function(x, ...) apply(X = x, ...))`. – Roman Luštrik Jul 09 '13 at 13:45
I debated that myself, but then sd uses only one optional, so I thought of including that directly. – asb Jul 09 '13 at 13:46
7

@sacvf, it doesn't matter how many comments you pour in with "my results are NA, any idea why?", we can't help unless we see the data you're having a problem with. You should direct some of your energy into making your question reproducible. – Arun Jul 09 '13 at 14:21
I also tried the solution given by `@Henry` but I got the smae:Just `NA` – sacvf Jul 09 '13 at 14:28
@savcf: I had disappeared for some time. You can't seriously ask something like 'what you meant if check na.rm'. You should really produce a reproducible example to get a correct solution. – asb Jul 09 '13 at 21:08

Thomas · Answer 5 · 2013-07-09T15:33:05.630

I don't know if these are particularly fast, but why not just use the formulae for SD:

x <- data.frame(y = rnorm(1000,0,1), z = rnorm(1000,2,3))

# If you have a population:
colsdpop <- function(x,...)
     sqrt(rowMeans((t(x)-colMeans(x,...))^2,...))
colsdpop(x)
sd(x$y); sd(x$z) # won't match `sd`

# If you have a sample:
colsdsamp <- function(x)
    sqrt( (rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1))) )
colsdsamp(x)
sd(x$y); sd(x$z) # will match `sd`

Note: the sample solution won't handle NAs well. One could incorporate something like apply(x,2,function(z) sum(!is.na(z))) into the right-most part of the formula to get an appropriate denominator, but it would get really murky quite quickly.

score 4 · Answer 6 · answered Apr 13 '14 at 23:34

I believe I have found a more elegant solution in diag(sqrt(var(data)))

This worked for me to get the standard deviation of each of my columns. However, it does compute a bunch of extra unnecessary covariances (and their square roots) along the way, so it isn't necessarily the most efficient approach. But if your data is small, it works excellently.

EDIT: I just realized that sqrt(diag(var(data))) is probably a bit more efficient, since it drops the unnecessary covariance terms earlier.

mysteRious · Answer 7 · 2019-10-10T04:05:12.517

3

I usually do column sd's with apply:

x <- data.frame(y = rnorm(20,0,1), z = rnorm(20,2,3))

> apply(x, 2, sd)
        y         z 
0.8022729 3.4700314

Verify:

> sd(x$y)
[1] 0.8022729

> sd(x$z)
[1] 3.470031

You can also do it with dplyr easily:

library(dplyr)
library(magrittr) # for pipes

> x %>% summarize_all(.,sd)
          y        z
1 0.8022729 3.470031

edited Oct 10 '19 at 04:05

answered Oct 10 '19 at 03:59

mysteRious

4,102
2
16
36

score 1 · Answer 8 · answered Nov 03 '19 at 15:44

1

You can just use apply function

all.sd <- apply(data, 2,sd)

answered Nov 03 '19 at 15:44

Julien Hubar

11
1

Is there such "colsd" in R?

8 Answers8

Linked

Related