Calculate groups of column means and standard deviations

Question

Having some trouble primarily with stdevs, and possibly an optimal means solution as well.

dat <- data.frame(matrix(rnorm(16*100), ncol=100)) # data

In this example I have a dataset of 100 columns, and I need to get the means and stdevs of each row in groups of 25 samples

I first found the code that would do that individually

as.data.frame(rowMeans(dat[,1:25]))     # mean of columns 1:25
as.data.frame(apply(dat[,1:25],1,mean)) # mean of columns 1:25
as.data.frame(apply(dat[,1:25],1,sd))   # sd of columns 1:25

Initially I used rowMeans and made this work via the loop below:

dat.means <- list() # create empty list for means
# mean of every 25 cols
count <- 1
for(i in seq(1,length(dat),25)){
  dat.means[[count]] <- cbind(rowMeans(as.data.frame(dat[,i:i+24])))
  count=count+1
}

At this point I couldn't find the equivalent of rowMeans to calculate standard deviation, so backtracked to the trying to use apply instead. My knowledge about how to use it in this way is pretty lacking however and I've only gotten errors at this point.

for(i in seq(1,length(dat),25)){
  dat.means[[count]] <- cbind(apply(dat[,i:i+24],1,mean))
  count=count+1
}

#Error in apply(dat[, i:i + 24], 1, mean) : 
# dim(X) must have a positive length

I've tried a few other iterations of the above loop but I'm still getting the posted error.

I also have a feeling that the loop might not be the best method, but am at a loss. Appreciate any help.

In response to the suspected duplicate question here Calculating sd and mean in a dataframe with NA values is not the question here, the question is how to apply the functions effectively on groups of columns in a larger dataframe

Possible duplicate of [means and SD for columns in a dataframe with NA values](http://stackoverflow.com/questions/20794284/means-and-sd-for-columns-in-a-dataframe-with-na-values) — Caleb Kleveter, Feb 10 '17 at 18:42

Jaap · Answer 1 · 2017-02-10T20:03:46.023

Using the data.table package:

# load 'data.table'
library(data.table)

# melt into long format and add 'row.id' variable with number of each row
dat2 <- melt(setDT(dat)[, row.id := .I], id = 'row.id')

# create a grouping variable for each block of 25 values
dat2[, grp := rep(1:4, each = 25), by = row.id]

# summarise
dat2[, .(mn = mean(value), std = sd(value) ), by = .(row.id,grp)]

which gives:

    row.id grp          mn       std
 1:      1   1 -0.30388554 1.0307631
 2:      2   1  0.04381967 0.7939788
 3:      3   1  0.03106169 0.8581719
 4:      4   1 -0.15215035 0.8200987

....

15:     15   1 -0.23641918 0.7024393
16:     16   1  0.09745967 1.0253811
17:      1   2 -0.16414997 0.8695713
18:      2   2 -0.06763887 1.0294245

....

31:     15   2  0.06034238 0.7756055
32:     16   2  0.16387033 0.9285894
33:      1   3  0.32860736 1.0802055
34:      2   3  0.51183174 0.9562819

....

47:     15   3  0.16075275 1.0335789
48:     16   3 -0.43298467 1.1010562
49:      1   4  0.24918962 0.9580600
50:      2   4 -0.13005426 1.1693455

....

62:     14   4  0.02436604 0.7341284
63:     15   4 -0.19614383 0.7039496
64:     16   4  0.01182338 0.8465747

How this works:

With setDT(dat) the dataframe is converted to a data.table (which is an enhanced form of a data.frame)
[, row.id := .I] add a variable with a rownumber
melt is then used to transform the data into long format with the rownumber as identifier.
Next, for each row.id a grouping variable is created with rep(1:4, each = 25) which creates a vector of 25 1's, then 25 2's and so on. So for example, the first 25 values for row.id == 1 (which correspond to the first 25 columns of the original dat-dataframe) get group id 1, the 2nd 25 values get group id 2, and so on.
Next you summarise with dat2[, .(mn = mean(value), std = sd(value) ), by = .(row.id,grp)] where you use row.id and grp as grouping variable.

The result is a mean and a standard deviation for each group of columns for each row.

Another option is to use a combination of dcast and melt and the possibility to specify multiple aggregate functions in dcast:

dcast(melt(setDT(dat)[, row.id := .I], id = 'row.id')[, grp := rep(1:4, each = 25), by = row.id],
      row.id ~ grp, fun.aggregate = list(mean, sd))

which gives:

    row.id value_mean_1 value_mean_2 value_mean_3 value_mean_4 value_sd_1 value_sd_2 value_sd_3 value_sd_4
 1:      1  -0.30388554  -0.16414997   0.32860736   0.24918962  1.0307631  0.8695713  1.0802055  0.9580600
 2:      2   0.04381967  -0.06763887   0.51183174  -0.13005426  0.7939788  1.0294245  0.9562819  1.1693455
 3:      3   0.03106169  -0.07250312   0.21619928   0.13092043  0.8581719  1.1439506  0.9441762  1.0006230
 4:      4  -0.15215035  -0.08417522  -0.27278714  -0.04190002  0.8200987  0.9008114  1.0394255  1.2063465
 5:      5   0.21871123   0.08029101  -0.04965507  -0.15279897  0.9593703  0.8409534  0.8878550  1.0157824
 6:      6   0.22335221   0.27142844   0.14032413   0.09975956  1.1154142  1.0896226  0.8587636  1.1147968
 7:      7   0.16725794  -0.03462013   0.14675249  -0.15678569  0.9991910  0.9236954  1.1258560  1.0250408
 8:      8  -0.12872236   0.03884649  -0.48565736  -0.30525278  1.0118579  1.0266040  1.1284902  0.9048042
 9:      9   0.25986114   0.25181718   0.07673463  -0.11521187  1.0509685  0.8352278  1.0952720  1.0706587
10:     10  -0.32670802  -0.04590547   0.22610217   0.09406650  1.0674699  0.8378048  0.8128130  0.9126611
11:     11  -0.16219092  -0.24172025  -0.14231462   0.03671087  1.1617784  1.0522955  0.8899262  0.8982543
12:     12   0.21109682   0.19735885  -0.03901236  -0.19283362  0.9064956  0.9530479  1.0422911  0.8323033
13:     13   0.11926882   0.29611127  -0.37648849  -0.08673776  1.0739078  0.7220276  0.9455307  0.9623676
14:     14   0.26478861   0.16054927  -0.03315950   0.02436604  1.0555501  1.0713119  0.9112082  0.7341284
15:     15  -0.23641918   0.06034238   0.16075275  -0.19614383  0.7024393  0.7756055  1.0335789  0.7039496
16:     16   0.09745967   0.16387033  -0.43298467   0.01182338  1.0253811  0.9285894  1.1010562  0.8465747

With dplyr/tidyr:

library(dplyr)
library(tidyr)
dat %>% 
  mutate(id = row_number()) %>% 
  gather(k, v, 1:100) %>% 
  group_by(id) %>% 
  mutate(grp = rep(1:4, each = 25)) %>% 
  group_by(id, grp) %>% 
  summarise(mn = mean(v), std = sd(v))

Or with base R:

dat2 <- reshape(data = dat, ids = rownames(dat), direction = 'long', varying = list(names(dat)), times = names(dat))
dat2 <- transform(dat2, grp = ave(id, id, FUN = function(i) rep(1:4, each = 25)))
aggregate(X1 ~ id + grp, dat2, FUN = function(x) c(std = sd(x), mn = mean(x)))

I'm not sure if this is right, i'm trying to get mean/sd per each 25-columns, which should result in a total of 4-columns. I'm still going through the code though. — tastycanofmalk, Feb 10 '17 at 19:25

TJGorrie · Accepted Answer · 2017-02-10T17:27:46.123

In Base R you can use tapply with a vector of the same length of your rows.

t(apply(dat, 1, function(row){
  tapply(row, INDEX=rep(1:4,c(25,25,25,25)), mean) # or sd
 })
)

So here we are running apply on your dataset for each row. This is passed to tapply where an index of each element in the row is categorised with an number 1, 2, 3 e.t.c (same length as row in this circumstance). And will apply the function as needed.

Output:

                 1           2           3            4
[1,] -0.121142260  0.09109255  0.14969065 -0.008491494
[2,]  0.100938120  0.05852706  0.01694019  0.142837311
[3,] -0.270040421 -0.13509216 -0.02526398  0.176398683
[4,] -0.098860947 -0.02428447  0.34782123 -0.113218821
[5,]  0.058705197  0.25760489  0.30359424  0.457067044
[6,] -0.004329987  0.16322551 -0.20793649 -0.100291690
[7,]  0.146482094  0.08483679  0.16754837 -0.027107295
[8,]  0.013796914 -0.09084366  0.23347784 -0.194043232
[9,] -0.292440563  0.03362355  0.03668636 -0.113120322
[10,] -0.083525957 -0.04704885  0.21239136  0.378796710
[11,]  0.355684510 -0.34531764 -0.17021181 -0.293445102
[12,]  0.165324616 -0.32272002 -0.28986401 -0.135609262
[13,]  0.134330325 -0.04966847  0.22928705  0.012515783
[14,] -0.117367280  0.14220143  0.03655234 -0.175041681
[15,]  0.313223877  0.29656269 -0.14042955 -0.173458094
[16,]  0.062781966  0.09551260 -0.05704605  0.048142911

This looks pretty good. Any idea how to automate the index? E.g. with variables as `replicates = 25; uniques = 4`. I just tried something like `tapply(row, INDEX=rep(1:uniques,rep(replicates, uniques), mean))` with no success. — tastycanofmalk, Feb 10 '17 at 19:47
hmm okay i think i must have messed the code, up above. I tried the code in my comment again and it worked. thanks! — tastycanofmalk, Feb 10 '17 at 19:54

Calculate groups of column means and standard deviations

2 Answers2