8

I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:

library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), na.rm=TRUE)

according to system.time(), it takes about this long to run:

   user  system elapsed 
   5.16    0.00    5.17

This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?

dnagirl
  • 20,196
  • 13
  • 80
  • 123
  • `ddply()` is first and foremost *convenient*. If you need something fast you may need to reimplement the logic. – Dirk Eddelbuettel Oct 19 '10 at 19:03
  • @Shane: There are currently 3*400 possible data sets (and increasing daily) that a user could request. It is unlikely that one user would hit on the same data set as another. So caching would only be useful inside a session. Since the output of the webapp is essentially a canned report, I don't think the user would usually request it more than once. Would you implement caching for the situation I've described? I've never done it before, so I'm at a bit of a loss. – dnagirl Oct 19 '10 at 19:17
  • @Dirk Eddelbuettel: do you mean "create df.median a different way" or "find a way that doesn't require df.median"? – dnagirl Oct 19 '10 at 19:22
  • No, I wouldn't cache in that case. Do you have an "acceptable" amount of time? There are many ways to optimize this, but there will be a speed vs. "ease of programming" trade-off as you need lower latency. – Shane Oct 19 '10 at 19:23
  • 1
    @dnagirl What @Dirk means is that `plyr` wasn't designed primarily for performance, but for ease of use. As an example, `llply` (which underlies most of the other plyr functions) is several times slower than `lapply`, even though the core functionality of both functions is the same. – Shane Oct 19 '10 at 19:25
  • 3
    @dnagirl, see also this related question: http://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations – JD Long Oct 19 '10 at 19:29
  • @Shane: Ah! I understand. I guess I will have to fiddle about with the base functions then. As for acceptable time, well... currently the entire script takes about 4 **minutes** to run. I'd like to get it under a minute. None of the steps appear appreciably worse than the one I've shown above. I'm not really sure why, combined, they take so long. – dnagirl Oct 19 '10 at 19:34
  • 1
    @dnagirl - `require(fortunes); fortune("dog")` and substitute "data" :-) Also, for future reference, use a different extension than `.R` for a `save()`ed R object. `.rda` is commonly used in R packages. `.R` usually means an R script. I spent a few minutes trying to figure out what `data.R` was before it dawned on me – Gavin Simpson Oct 19 '10 at 20:24

6 Answers6

9

Just using aggregate is quite a bit faster...

> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
> 
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   1.89    0.00    1.89 
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
   user  system elapsed 
   5.06    0.00    5.06 
> 
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
> 
> identical(ag.median, df.median)
[1] TRUE
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
7

Just to summarize some of the points from the comments:

  1. Before you start to optimize, you should have some sense for "acceptable" performance. Depending upon the required performance, you can then be more specific about how to improve the code. For instance, at some threshold, you would need to stop using R and move onto a compiled language.
  2. Once you have an expected run-time, you can profile your existing code to find potential bottlenecks. R has several mechanisms for this, including Rprof (there are examples on stackoverflow if you search for [r] + rprof).
  3. plyr is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.
Community
  • 1
  • 1
Shane
  • 98,550
  • 35
  • 224
  • 217
  • Thanks for the summary. And thanks to everyone who contributed such useful information. I have a lot of reading to do! – dnagirl Oct 20 '10 at 19:03
4

The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.

x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
   user  system elapsed 
   3.47    0.33    3.80

system.time(for(i in 1:1e2) median(y))
   user  system elapsed 
   5.03    0.26    5.29

For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
3

Working with this data is considerably faster with dplyr:

library(dplyr)

system.time({
  data %>% 
    group_by(groupname, starttime, fPhase, fCycle) %>%
    summarise_each(funs(median(., na.rm = TRUE)), inadist:larct)
})
#>    user  system elapsed 
#>   0.391   0.004   0.395

(You'll need dplyr 0.2 to get %>% and summarise_each)

This compares favourable to plyr:

library(plyr)
system.time({
  df.median <- ddply(data, .(groupname, starttime, fPhase, fCycle), 
    numcolwise(median), na.rm = TRUE)
})
#>    user  system elapsed 
#>   0.991   0.004   0.996

And to aggregate() (code from @joshua-ulrich)

groupVars <- c("groupname", "starttime", "fPhase", "fCycle")
dataVars <- colnames(data)[ !(colnames(data) %in% c("location", groupVars))]
system.time({
  ag.median <- aggregate(data[,dataVars], data[,groupVars], median)
})
#>    user  system elapsed 
#>   0.532   0.005   0.537
hadley
  • 102,019
  • 32
  • 183
  • 245
3

To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:

> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   3.472   0.020   3.615 
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
   user  system elapsed 
   0.936   0.008   1.006 
VitoshKa
  • 8,387
  • 3
  • 35
  • 59
2

Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,

> system.time(table(BB$year))
    user  system elapsed 
   0.007   0.002   0.009 

> system.time(ddply(BB, .(year), 'nrow'))
    user  system elapsed 
   0.183   0.005   0.189 

Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert:

dt1 = data.table(my_dataframe)
doug
  • 69,080
  • 24
  • 165
  • 199