22

I understand what tapply() does in R. However, I cannot parse this description of it from the documentaion:


Apply a Function Over a "Ragged" Array

Description:

     Apply a function to each cell of a ragged array, that is to each
     (non-empty) group of values given by a unique combination of the
     levels of certain factors.

Usage:

     tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

When I think of tapply, I think of group by in sql. You group values in X together by its parallel factor levels in INDEX and apply FUN to those groups. I have read the description of tapply 100 times and still can't figure out how what it says maps to how I understand tapply. Perhaps someone can help me parse it?

frankc
  • 11,290
  • 4
  • 32
  • 49

2 Answers2

20

@joran's great answer helped me understand it (so please vote for his - I would have added it as comment if it wasn't too long for that), but this may be of help to some:

In quite a few languages, you have twodimensional arrays. Depending on the language, these arrays have fixed dimensions (i.e.: each row has the same number of columns), or some languages allow the number of items per row to differ. So instead of:

A: 1  2  3
B: 4  5  6
C: 7  8  9

You could get something like

A: 1  3
B: 4  5  6
C: 8

This is called a ragged array because, well, the right side of it looks ragged. In typical R-style, we might represent this as two vectors:

values<-c(1,3,4,5,6,8)
names<-c("A", "A", "B", "B", "B", "C")

So tapply with these two vectors as the first parameters indeed allows us to apply this function to each 'row' of our ragged array.

Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
  • 2
    +1 Nice. I think what confuses people (including me when I first read the `tapply` docs) is that I immediately think if I had it in a 'ragged' form, why wouldn't I just store it in a `list(A=c(1,3),B=c(4,5,6),C=8)` and use `lapply`? The key for me was to realize it's useful when you've organized your data in a 'long' form, like by cbind-ing `values` and `names`. – joran Jun 09 '11 at 22:46
  • 3
    @joran : if you do `split(values,names)` you get the ragged array. If you look at the source code of `tapply`, you see that it does exactly that, and then uses `sapply` over the obtained list. – Joris Meys Jun 09 '11 at 23:12
  • 1
    @Joris - Yeah; Most of the confusion comes (IMHO) from the wording of the documentation. Once you look at the source, it's fairly clear what's going on. But the first time I read that I sure as heck was left scratching me head as to what they meant by a ragged array. – joran Jun 09 '11 at 23:22
19

Let's see what the R documentation says on the subject:

The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.

The list of factors you supply via INDEX together specify a collection of subsets of X, of possibly different lengths (hence, the 'ragged' descriptor). And then FUN is applied to each subset.

EDIT: @Joris makes an excellent point in the comments. It may be helpful to think of tapply(X,Y,...) as a wrapper for sapply(split(X,Y),...) in that if Y is a list of grouping factors, it builds a new, single grouping factor based on their unique levels, splits X accordingly and applies FUN to each piece.

EDIT: Here's an illustrative example:

library(lattice)
library(plyr)
set.seed(123)

#Make this example unbalanced
dat <- barley[sample(1:120,50),]

#Suppose we want the avg yield by year/site:
table(dat$year,dat$site)

#That's what they mean by 'ragged' array; there are different
# numbers of obs at each comb of levels

#In plyr we could use ddply:
ddply(dat,.(year,site),.fun=function(x){mean(x$yield)})

#Which gives the same result (listed in a diff order) as:
melt(tapply (dat$yield, list (dat$year, dat$site), mean))
joran
  • 169,992
  • 32
  • 429
  • 468
  • 6
    It may be nice to note that `tapply(X,Y,...)` is in essence nothing more than a wrapper for `sapply( split(X,Y), ...)` which illustrates the ragged array rather clearly. – Joris Meys Jun 09 '11 at 22:44