1

A common use case in R (at least for me) is identifying observations in a data frame that have some characteristic that depends on the values in some subset of other observations.

To make this more concerete, suppose I have a number of workers (indexed by WorkerId) that have an associated "Iteration":

    raw <- data.frame(WorkerId=c(1,1,1,1,2,2,2,2,3,3,3,3),
              Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))

and I want to eventually subset the data frame to exclude the "last" iteration (by creating a "remove" boolean) for each worker. I can write a function to do this:

raw$remove <- mapply(function(wid,iter){
                              iter==max(raw$Iteration[raw$WorkerId==wid])},
                 raw$WorkerId, raw$Iteration)

> raw$remove
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

but this gets very slow as the data frame gets larger (presumably because I'm needlessly computing the max for every observation).

My question is what's the more efficient (and idiomatic) way of doing this in the functional programming style. Is it first creating a the WorkerId to Max value dictionary and then using that as a parameter in another function that operates on each observation?

John Horton
  • 4,122
  • 6
  • 31
  • 45
  • Your example is covered by another r-question: [Extracting indices for data frame rows that have MAX value for named field](http://stackoverflow.com/q/6025051/168747). – Marek May 29 '11 at 23:08

5 Answers5

3

The "most natural way" IMO is the split-lapply-rbind method. You start by split()-ting into a list of groups, then lapply() the processing rule (in this case removing the last row) and then rbind() them back together. It's all doable as a nested set of function calls. The inner two steps are illustrated here and the final one-liner is presented at the bottom:

> lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] )
$`1`
  WorkerId Iteration
1        1         1
2        1         2
3        1         3

$`2`
  WorkerId Iteration
5        2         1
6        2         2
7        2         3

$`3`
   WorkerId Iteration
9         3         1
10        3         2
11        3         3

do.call(rbind,  lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] ) ) 

Hadley Wickham has developed a wide set of tools, the plyr package, that extend this strategy to a wider variety of tasks.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    Better to split an index and subset? `splt <- split(seq_len(nrow(raw)), raw$WorkerId); idx <- unlist(lapply(splt, function(x) x[-length(x)]), use.names=FALSE); raw[idx,]` – Martin Morgan May 29 '11 at 13:31
  • Martin, when you say so, I am not one to argue. I assume you suggest this as an efficiency enhancement, because it would not be working on the full dataframe? – IRTFM May 29 '11 at 13:47
  • Yes, working with just the data needed. Both `split` and `rbind` on a data frame will be expensive relative to `split` on a vector and subset. I like how your answer illustrates a useful pattern. – Martin Morgan May 29 '11 at 13:53
  • Thanks - this was really helpful (as was the pointer to plyr). – John Horton May 29 '11 at 13:59
3

For the specific problem posed !rev(duplicated(rev(raw$WorkerId))) or better, following Charles' advice, !duplicated(raw$WorkerId, fromLast=TRUE)

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • See also the `fromLast` argument to avoid reversing twice. – Charles May 30 '11 at 14:49
  • Brilliant. I stole it to [update one of my answers](http://stackoverflow.com/questions/6025051/extracting-indices-for-data-frame-rows-that-have-max-value-for-named-field/6037559#6037559). – Marek May 31 '11 at 08:22
2

This situation is tailor-made for using the plyr package.

ddply(raw, .(WorkerId), function(df) df[-NROW(df),])

It produces the output

WorkerId Iteration
1        1         1
2        1         2
3        1         3
4        2         1
5        2         2
6        2         3
7        3         1
8        3         2
9        3         3
Ramnath
  • 54,439
  • 16
  • 125
  • 152
2
subset(raw, Iteration != ave(Iteration, WorkerId, FUN=max))
Charles
  • 4,389
  • 2
  • 16
  • 13
1
remove <- with(raw, as.logical(ave(Iteration, WorkerId, FUN=function(x) c(rep(TRUE, length(x)-1), FALSE)))))
Eduardo Leoni
  • 8,991
  • 6
  • 42
  • 49