how to speed up this R code

Question

I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:

library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), na.rm=TRUE)

according to system.time(), it takes about this long to run:

   user  system elapsed 
   5.16    0.00    5.17

This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?

`ddply()` is first and foremost *convenient*. If you need something fast you may need to reimplement the logic. — Dirk Eddelbuettel, Oct 19 '10 at 19:03
@Shane: There are currently 3*400 possible data sets (and increasing daily) that a user could request. It is unlikely that one user would hit on the same data set as another. So caching would only be useful inside a session. Since the output of the webapp is essentially a canned report, I don't think the user would usually request it more than once. Would you implement caching for the situation I've described? I've never done it before, so I'm at a bit of a loss. — dnagirl, Oct 19 '10 at 19:17
@Dirk Eddelbuettel: do you mean "create df.median a different way" or "find a way that doesn't require df.median"? — dnagirl, Oct 19 '10 at 19:22
No, I wouldn't cache in that case. Do you have an "acceptable" amount of time? There are many ways to optimize this, but there will be a speed vs. "ease of programming" trade-off as you need lower latency. — Shane, Oct 19 '10 at 19:23
@dnagirl What @Dirk means is that `plyr` wasn't designed primarily for performance, but for ease of use. As an example, `llply` (which underlies most of the other plyr functions) is several times slower than `lapply`, even though the core functionality of both functions is the same. — Shane, Oct 19 '10 at 19:25
@dnagirl, see also this related question: http://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations — JD Long, Oct 19 '10 at 19:29
@Shane: Ah! I understand. I guess I will have to fiddle about with the base functions then. As for acceptable time, well... currently the entire script takes about 4 **minutes** to run. I'd like to get it under a minute. None of the steps appear appreciably worse than the one I've shown above. I'm not really sure why, combined, they take so long. — dnagirl, Oct 19 '10 at 19:34
@dnagirl - `require(fortunes); fortune("dog")` and substitute "data" :-) Also, for future reference, use a different extension than `.R` for a `save()`ed R object. `.rda` is commonly used in R packages. `.R` usually means an R script. I spent a few minutes trying to figure out what `data.R` was before it dawned on me — Gavin Simpson, Oct 19 '10 at 20:24

score 9 · Accepted Answer · answered Oct 19 '10 at 19:51

Just using aggregate is quite a bit faster...

> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
> 
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   1.89    0.00    1.89 
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
   user  system elapsed 
   5.06    0.00    5.06 
> 
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
> 
> identical(ag.median, df.median)
[1] TRUE

`aggregate` fixes this problem handily. – dnagirl Oct 20 '10 at 19:01 — dnagirl, Oct 20 '10 at 19:01

score 7 · Answer 2 · edited May 23 '17 at 12:24

Just to summarize some of the points from the comments:

Before you start to optimize, you should have some sense for "acceptable" performance. Depending upon the required performance, you can then be more specific about how to improve the code. For instance, at some threshold, you would need to stop using R and move onto a compiled language.
Once you have an expected run-time, you can profile your existing code to find potential bottlenecks. R has several mechanisms for this, including Rprof (there are examples on stackoverflow if you search for [r] + rprof).
plyr is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.

Thanks for the summary. And thanks to everyone who contributed such useful information. I have a lot of reading to do! — dnagirl, Oct 20 '10 at 19:03

score 4 · Answer 3 · answered Oct 20 '10 at 13:57

The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.

x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
   user  system elapsed 
   3.47    0.33    3.80

system.time(for(i in 1:1e2) median(y))
   user  system elapsed 
   5.03    0.26    5.29

For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).

score 3 · Answer 4 · answered Apr 16 '14 at 15:21

Working with this data is considerably faster with dplyr:

library(dplyr)

system.time({
  data %>% 
    group_by(groupname, starttime, fPhase, fCycle) %>%
    summarise_each(funs(median(., na.rm = TRUE)), inadist:larct)
})
#>    user  system elapsed 
#>   0.391   0.004   0.395

(You'll need dplyr 0.2 to get %>% and summarise_each)

This compares favourable to plyr:

library(plyr)
system.time({
  df.median <- ddply(data, .(groupname, starttime, fPhase, fCycle), 
    numcolwise(median), na.rm = TRUE)
})
#>    user  system elapsed 
#>   0.991   0.004   0.996

And to aggregate() (code from @joshua-ulrich)

groupVars <- c("groupname", "starttime", "fPhase", "fCycle")
dataVars <- colnames(data)[ !(colnames(data) %in% c("location", groupVars))]
system.time({
  ag.median <- aggregate(data[,dataVars], data[,groupVars], median)
})
#>    user  system elapsed 
#>   0.532   0.005   0.537

score 3 · Answer 5 · answered Oct 19 '10 at 21:11

3

To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:

> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   3.472   0.020   3.615 
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
   user  system elapsed 
   0.936   0.008   1.006

answered Oct 19 '10 at 21:11

VitoshKa

8,387
3
35
59

1

very interesting! I'll keep that in mind. Unfortunately, this data has to compare medians. – dnagirl Oct 20 '10 at 13:02

score 2 · Answer 6 · answered Oct 19 '10 at 20:57

Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,

> system.time(table(BB$year))
    user  system elapsed 
   0.007   0.002   0.009 

> system.time(ddply(BB, .(year), 'nrow'))
    user  system elapsed 
   0.183   0.005   0.189

Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert:

dt1 = data.table(my_dataframe)

how to speed up this R code

6 Answers6