0

I am currently using ddply to apply a function I have written to a data frame. The function evaluates each row based on the values in the columns and then applies a number of other functions to the data in that row. The result is a data frame with the same structure as the input data frame and an additional column with the result of the applied function for each row.

My problem is the data set is reasonably large and therefore using ddply takes a long time - too long for the purpose!

I have read a number of other SO questions and blog posts on replacements to ddply when time is of the importance. Most posts either recommend using data.table or some combination of functions in the dplyr package with do. While speed is of the most importance, I have never used data.table so ease of use / intuitiveness is also important.

Similarly, while this question was very useful in explaining how to use different dplyr functions in combination your own function, I also need to pass in other objects to my function, which I am unsure how to do using the answer in the question.

I have created a simplified example below. My question then is how to replicate the below ddply function call with either dplyr or data table given my above points.

First, I set up some data to mimic the structure of the actual data

noObs <- 1e5
dataIn <- data.frame(One = rep(c("J", "K"), noObs/2), Two = rep(c("ID", "BR", "LB", "OZ"), noObs/4),
                     Three = runif(noObs))

secondaryData <- data.frame(Two = c("ID", "BR", "LB", "OZ"), Size = c(300, 500, 250, 400))

A simplified example of my function is below (in practice, the function parameters are greater than 2 and it calls other functions in itself)

MyFunction <- function(dataIn, secondaryData){

  groupNames <- c("BR", "LB")

  if(dataIn$One == "J"){
    if(!(dataIn$Two%in%groupNames)){
      if(dataIn$Two == "ID"){
        idx <- match(dataIn$Two, secondaryData$Two)
        value <- secondaryData[idx, "Size"]
        dataIn$newCalc <- dataIn$Three*value
      }else{
        dataIn$newCalc <- dataIn$Three*1000
      }
    }else{
      idx <- match(dataIn$Two, secondaryData$Two)
      value <- secondaryData[idx, "Size"]
      dataIn$newCalc <- dataIn$Three*value+1
    }
  }else{
    idx <- match(dataIn$Two, secondaryData$Two)
    value <- secondaryData[idx, "Size"]
    dataIn$newCalc <- dataIn$Three*value
  }

  return(dataIn)

}

The ddply call looked like

dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)

Finally, some examples of things I have tried (I am yet to try data.table)

dataIn %>% group_by(names(dataIn)) %>% do(MyFunction(dataIn, secondaryData))
dataIn %>% group_by(names(dataIn)) %>% MyFunction(dataIn, secondaryData)
dataIn %>% group_by(.dots = names(dataIn)) %>% MyFunction(secondaryData)

EDIT

I have been able to find a way with dplyr that works except it is even slower than with ddply and I can't figure out how to use group_by with names. This doesn't seem right to me as dplyr is meant to be faster.

In addition, I have been experimenting with data.table, but haven't been able to get it to work. Again, I am looking for something that runs faster than ddply

#Plyr
start <- proc.time()
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
plyrTime <- proc.time() - start

#Dplyr
#Works
start <- proc.time()
res <- dataIn %>% group_by(One, Two, Three) %>% do(MyFunction(.,secondaryData))
dplyrTime <- proc.time() - start
#Doesn't work
res <- dataIn %>% group_by(.,names(dataIn)) %>% do(MyFunction(.,secondaryData))

#Data.table
dataInDT <- data.table(dataIn)
dataInDT[,.(MyFunction(.,secondaryData)), by=.(One, Two, Three)] 
Community
  • 1
  • 1
Celeste
  • 337
  • 4
  • 15
  • Maybe `library(dplyr); dataIn %>% group_by_(.dots = names(dataIn)) %>% myFunction(projSettings, secondaryData)`. Please provide a reproducible example. – lukeA Nov 23 '15 at 09:06
  • 6
    SO is not a codewriting service. Show us what you've already tried (including a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610)). – Jaap Nov 23 '15 at 09:23
  • @lukeA I have added a reproducible example. Hope this helps – Celeste Nov 25 '15 at 05:39
  • Thanks. What was wrong with the result of `dataIn %>% group_by_(.dots = names(dataIn)) %>% MyFunction(secondaryData)`? I can't see that you tried that. – lukeA Nov 25 '15 at 08:40
  • @lukeA thank you for the comment. I did try that except it produces a warning. Upon further investigation you can see that the suggested method doesn't properly apply the algorithm, in other words it just takes the first element – Celeste Nov 25 '15 at 22:06

1 Answers1

0

I found a solution using data.table. Notably, it performs the correct calculations for each row but at a remarkably faster speed. The format of function is different to adapt to the different style of data.table. I'm sure there is an even better or more correct way to solve it using data.table, but the below solution works well.

dataInDT <- data.table(dataIn)

groupNames <- c("BR", "LB")
start <- proc.time()
dataInDT[, NewCalc := {
  if(One == "J"){
    if(!(Two%in%groupNames)){
      if(Two == "ID"){
        Three*secondaryData[match(Two, secondaryData$Two), "Size"]
      }else{
        Three*1000
      }
    }else{
      Three*secondaryData[match(Two, secondaryData$Two), "Size"]+1
    }
  }else{
    Three*secondaryData[match(Two, secondaryData$Two), "Size"]
  }}, by=.(One, Two, Three)]
datTableTime <- proc.time() - start

Comparing this to the old solution and you can see the speed is greatly improved

start <- proc.time()
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
plyrTime <- proc.time() - start

Of course, in practice the data.table function I used was even more intricate, in particular the by section was much longer.

I was unable to find a solution using dplyr and am still curious to know how it would work.

Celeste
  • 337
  • 4
  • 15