I am currently using ddply
to apply a function I have written to a data frame. The function evaluates each row based on the values in the columns and then applies a number of other functions to the data in that row. The result is a data frame with the same structure as the input data frame and an additional column with the result of the applied function for each row.
My problem is the data set is reasonably large and therefore using ddply
takes a long time - too long for the purpose!
I have read a number of other SO questions and blog posts on replacements to ddply
when time is of the importance. Most posts either recommend using data.table or some combination of functions in the dplyr
package with do
. While speed is of the most importance, I have never used data.table so ease of use / intuitiveness is also important.
Similarly, while this question was very useful in explaining how to use different dplyr
functions in combination your own function, I also need to pass in other objects to my function, which I am unsure how to do using the answer in the question.
I have created a simplified example below. My question then is how to replicate the below ddply
function call with either dplyr
or data table
given my above points.
First, I set up some data to mimic the structure of the actual data
noObs <- 1e5
dataIn <- data.frame(One = rep(c("J", "K"), noObs/2), Two = rep(c("ID", "BR", "LB", "OZ"), noObs/4),
Three = runif(noObs))
secondaryData <- data.frame(Two = c("ID", "BR", "LB", "OZ"), Size = c(300, 500, 250, 400))
A simplified example of my function is below (in practice, the function parameters are greater than 2 and it calls other functions in itself)
MyFunction <- function(dataIn, secondaryData){
groupNames <- c("BR", "LB")
if(dataIn$One == "J"){
if(!(dataIn$Two%in%groupNames)){
if(dataIn$Two == "ID"){
idx <- match(dataIn$Two, secondaryData$Two)
value <- secondaryData[idx, "Size"]
dataIn$newCalc <- dataIn$Three*value
}else{
dataIn$newCalc <- dataIn$Three*1000
}
}else{
idx <- match(dataIn$Two, secondaryData$Two)
value <- secondaryData[idx, "Size"]
dataIn$newCalc <- dataIn$Three*value+1
}
}else{
idx <- match(dataIn$Two, secondaryData$Two)
value <- secondaryData[idx, "Size"]
dataIn$newCalc <- dataIn$Three*value
}
return(dataIn)
}
The ddply
call looked like
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
Finally, some examples of things I have tried (I am yet to try data.table
)
dataIn %>% group_by(names(dataIn)) %>% do(MyFunction(dataIn, secondaryData))
dataIn %>% group_by(names(dataIn)) %>% MyFunction(dataIn, secondaryData)
dataIn %>% group_by(.dots = names(dataIn)) %>% MyFunction(secondaryData)
EDIT
I have been able to find a way with dplyr
that works except it is even slower than with ddply
and I can't figure out how to use group_by
with names
. This doesn't seem right to me as dplyr
is meant to be faster.
In addition, I have been experimenting with data.table
, but haven't been able to get it to work. Again, I am looking for something that runs faster than ddply
#Plyr
start <- proc.time()
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
plyrTime <- proc.time() - start
#Dplyr
#Works
start <- proc.time()
res <- dataIn %>% group_by(One, Two, Three) %>% do(MyFunction(.,secondaryData))
dplyrTime <- proc.time() - start
#Doesn't work
res <- dataIn %>% group_by(.,names(dataIn)) %>% do(MyFunction(.,secondaryData))
#Data.table
dataInDT <- data.table(dataIn)
dataInDT[,.(MyFunction(.,secondaryData)), by=.(One, Two, Three)]