A common use case in R (at least for me) is identifying observations in a data frame that have some characteristic that depends on the values in some subset of other observations.
To make this more concerete, suppose I have a number of workers (indexed by WorkerId) that have an associated "Iteration":
raw <- data.frame(WorkerId=c(1,1,1,1,2,2,2,2,3,3,3,3),
Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))
and I want to eventually subset the data frame to exclude the "last" iteration (by creating a "remove" boolean) for each worker. I can write a function to do this:
raw$remove <- mapply(function(wid,iter){
iter==max(raw$Iteration[raw$WorkerId==wid])},
raw$WorkerId, raw$Iteration)
> raw$remove
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
but this gets very slow as the data frame gets larger (presumably because I'm needlessly computing the max for every observation).
My question is what's the more efficient (and idiomatic) way of doing this in the functional programming style. Is it first creating a the WorkerId to Max value dictionary and then using that as a parameter in another function that operates on each observation?