17

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?

I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.

For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.

Here's some example data

dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))

            a          b            c          d           e          f
1  -0.9089594 -0.8144765  0.872691548  0.4051094 -0.09705234 -1.5100709
2   0.7993102  0.3243804  0.394560355  0.6646588  0.91033497  2.2504104
3   0.2963102 -0.2911078 -0.243723116  1.0661698 -0.89747522 -0.8455833
4  -0.4311512 -0.5997466 -0.545381175  0.3495578  0.38359390  0.4999425
5  -0.4955802  1.8949285 -0.266580411  1.2773987 -0.79373386 -1.8664651
6   1.0957793 -0.3326867 -1.116623982 -0.8584253  0.83704172  1.8368212
7  -0.2529444  0.5792413 -0.001950741  0.2661068  1.17515099  0.4875377
8   1.2560402  0.1354533  1.440160168 -2.1295397  2.05025701  1.0377283
9   0.8123061  0.4453768  1.598246016  0.7146553 -1.09476532  0.0600665
10  0.1084029 -0.4934862 -0.584671816 -0.8096653  1.54466019 -1.8117459
11 -0.8152812  0.9494620  0.100909570  1.5944528  1.56724269  0.6839954
12  0.3130357  2.6245864  1.750448404 -0.7494403  1.06055267  1.0358267
13  1.1976817 -1.2110708  0.719397607 -0.2690107  0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443  0.34302096 -0.8024803
15  0.2361756  0.6773727  1.279737692  0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335  0.753090683  2.0394865  0.79006103  0.5704210

I'm after something like this

            X1          X2
1  -0.28358147 -0.40067128
2   0.50608365  1.27513471
3  -0.07950691 -0.22562957
4  -0.52542633  0.41103139
5   0.37758930 -0.46093340
6  -0.11784382  0.60514586
7   0.10811540  0.64293184
8   0.94388455  0.31948189
9   0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11  0.07836345  1.28189698
12  1.56269017  0.44897971
13  0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15  0.73109533  0.11872758
16 -0.54599850  1.13332286

which I did with this, but is obviously no good for my much larger data frame...

data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))

I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.

Ben
  • 41,615
  • 18
  • 132
  • 227
  • Is it always every three columns? Are you feeding a vector of vectors of names or a vector of a vector of indices? If user user1317221_G's answer isn't what you're after perhaps you need to give more info. – Tyler Rinker May 19 '12 at 00:52
  • 1
    For posterity, the question above appears to be the transpose of this more recent question about applying a function to groups of rows (and has some different approaches): http://stackoverflow.com/q/10837258/1036500 – Ben Jun 07 '12 at 16:41

6 Answers6

18

This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:

x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

Works if you just have col names too:

x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

EDIT

Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:

dat <- data.frame(matrix(rnorm(16*100), ncol=100))

n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))

EDIT 2 Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:

n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]

do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    This leaves out the last column because it doesn't have three columns to bind together. – Tyler Rinker May 19 '12 at 01:21
  • yes, your edit does exactly what I'm looking for, thanks very much. Sorry that my question was poorly formed, it was borne of some long hours of fruitless attempts... – Ben May 19 '12 at 01:42
  • 1
    I'm going to ask for a better way to create the indices and link back here. – Tyler Rinker May 19 '12 at 02:33
  • 1
    Here's a link to that question for future searchers [LINK](http://stackoverflow.com/questions/10661937/make-a-list-of-equal-length-vectors) – Tyler Rinker May 19 '12 at 02:51
  • 3
    Some other method for indexes: split(1:n,rep(1:n,each=3,length=n)). Here n is number of columns. – Wojciech Sobala May 19 '12 at 02:55
  • 1
    @WojciechSobala can you post that answer to the link above 9though you'll have to remove the last list index as it is not of length 3. – Tyler Rinker May 19 '12 at 03:15
8

A similar question was asked here by @david: averaging every 16 columns in r (now closed), which I answered by adapting @TylerRinker's answer above, following a suggestion by @joran and @Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.

# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
    # Create index list
    if (length(by) == 1)
    {
        nc <- ncol(x)
        split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
    } else # 'by' is a vector of groups
    {
        nc <- length(by)
        split.index <- by
    }
    index.list <- split(seq(from = 1, to = nc), split.index)

    # Pass index list to fun using sapply() and return object
    sapply(index.list, function(i)
            {
                do.call(fun, list(x[, i], ...))
            })
}

Then, to find the mean of the replicates:

byapply(dat, 3, rowMeans)

Or, perhaps the standard deviation of the replicates:

byapply(dat, 3, apply, 1, sd)

Update

by can also be specified as a vector of groups:

byapply(dat, c(1,1,1,2,2,2), rowMeans)
Community
  • 1
  • 1
jthetzel
  • 3,603
  • 3
  • 25
  • 38
7

mean for rows from vectors a,b,c

 rowMeans(dat[1:3])

means for rows from vectors d,e,f

 rowMeans(dat[4:6])

all in one call you get

results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))

if you only know the names of the columns and not the order then you can use:

rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))

#I dont know how much damage this does to speed but should still be quick
user1317221_G
  • 15,087
  • 3
  • 52
  • 78
  • 1
    And what about for a data frame with hundreds of columns? How can you generalize this? – Ben May 19 '12 at 00:34
  • @joran, you're right, I was too hasty in preparing my question, sorry for the ambiguity. Tyler Rinker's edit has the code that does what I'm after. – Ben May 19 '12 at 01:39
5

The rowMeans solution will be faster, but for completeness here's how you might do this with apply:

t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
joran
  • 169,992
  • 32
  • 429
  • 468
  • 1
    How about row means for every consecutive set of three columns in a data frame with several hundred columns? – Ben May 19 '12 at 00:44
  • 2
    @Ben Reduce it to a problem you've already solved: (1) transpose (2) use **plyr** or **data.table**, (3) transpose back. (Assuming everything is numeric.) – joran May 19 '12 at 00:58
  • I'll give that a shot and see if I can come up with something more efficient that Tyler's solution above (unlikely, but worth a try!) – Ben May 19 '12 at 01:40
  • thanks for those tips, I've come up with two approaches based on your suggestions (though perhaps not exactly what you had in mind...), see above. – Ben May 19 '12 at 04:47
2

Inspired by @joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):

Make a data frame of example data with p cols to simulate a realistic data set (following @TylerRinker's answer above and unlike my poor example in the question)

p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))

Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)

n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))

Now use apply and tapply to get row means for each of the groups

dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))

The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.

Further to @joran's suggestion, here's a data.table solution:

p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <-  data.frame(t(dat))

n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))

library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]

Thanks everyone for your quick and patient efforts!

Ben
  • 41,615
  • 18
  • 132
  • 227
  • 2
    Just to add a pointer that the `lapply(.SD,mean)` idiom should get much faster in v1.8.1 thanks to: i) a discovery in [this question](http://stackoverflow.com/questions/10584993/r-loop-over-columns-in-data-table) and ii) automatic .Internal()isation of `mean()` (wiki point 3 no longer needed). Also, `.SDcols` is often useful but not needed here. – Matt Dowle May 21 '12 at 16:30
  • @MatthewDowle thanks for your note! Good to know about `.SDcols`, not one I was familiar with, and great to hear `data.table` just keeps getting faster! – Ben May 22 '12 at 01:57
0

There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.

combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))

To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Adam Erickson
  • 6,027
  • 2
  • 46
  • 33
  • what do you mean by "if the order matters" and what is the combinat::permn function? Can you edit the code please? – user3495945 Mar 07 '16 at 08:30
  • Combinations are not the same thing as permutations: https://www.youtube.com/watch?v=s2W6Bce_T30 If the order of inputs matters, then it is the permutation that you seek. In this case, 'order' is in reference to the order of columns. – Adam Erickson Mar 08 '16 at 11:53