Converting package using S3 to S4 classes, is there going to be performance drop?

Question

I have an R package which currently uses S3 class system, with two different classes and several methods for generic S3 functions like plot, logLik and update (for model formula updating). As my code has become more complex with all the validity checking and if/else structures due to to the fact that there's no inheritance or dispatching based on two arguments in S3, I have started to think of converting my package to S4. But then I started to read about the advantages and and disadvantages of S3 versus S4, and I'm not so sure anymore. I found R-bloggers blog post about efficiency issues in S3 vs S4, and as that was 5 years ago, I tested the same thing now:

library(microbenchmark)
setClass("MyClass", representation(x="numeric"))
microbenchmark(structure(list(x=rep(1, 10^7)), class="MyS3Class"),
               new("MyClass", x=rep(1, 10^7)) )
Unit: milliseconds
                                                   expr
 structure(list(x = rep(1, 10^7)), class = "MyS3Class")
                       new("MyClass", x = rep(1, 10^7))
       min       lq   median       uq      max neval
 148.75049 152.3811 155.2263 159.8090 323.5678   100
  75.15198 123.4804 129.6588 131.5031 241.8913   100

So in this simple example, S4 was actually bit faster. Then I read SO question about using S3 vs S4, which was quite much in favor of S3. Especially @joshua-ulrich 's answer made me doubt against S4, as it said that

any slot change requires a full object copy

That feels like a big issue if I consider my case where I'm updating my object in every iteration when optimizing log-likelihood of my model. After some googling I found John Chambers post about this issue, which seems to be changing in R 3.0.0.

So although I feel it would be beneficial to use S4 classes for some clarity in my codes (for example more classes inheriting from the main model class), and for the validity checks etc, I am now wondering is it worth all the work in terms of performance? So, performance wise, is there real performance differences between S3 and S4? Is there some other performance issues I should be considering? Or is it even possible to say something about this issue in general?

EDIT: As @DWin and @g-grothendieck suggested, the above benchmarking doesn't consider the case where the slot of an existing object is altered. So here's another benchmark which is more relevant to the true application (the functions in the example could be get/set functions for some elements in the model, which are altered when maximizing the log-likelihood):

objS3<-structure(list(x=rep(1, 10^3), z=matrix(0,10,10), y=matrix(0,10,10)),
                 class="MyS3Class")
fnS3<-function(obj,a){
  obj$y<-a
  obj
}

setClass("MyClass", representation(x="numeric",z="matrix",y="matrix"))
objS4<-new("MyClass", x=rep(1, 10^3),z=matrix(0,10,10),y=matrix(0,10,10))
fnS4<-function(obj,a){ 
  obj@y<-a
  obj
}

a<-matrix(1:100,10,10)
microbenchmark(fnS3(objS3,a),fnS4(objS4,a))
Unit: microseconds
           expr    min     lq median     uq    max neval
 fnS3(objS3, a)  6.531  7.464  7.932  9.331 26.591   100
 fnS4(objS4, a) 21.459 22.393 23.325 23.792 73.708   100

The benchmarks are performed on R 2.15.2, on 64bit Windows 7. So here S4 is clearly slower.

Your benchmark concerns object creation. Is that really your concern? Does you application generate a large number of objects? — G. Grothendieck, Mar 23 '13 at 19:47
No you are rigth, the original benchmarking wasn't really relevant here, I used it more like a comparison to the blog post I linked. I added another benchmark which should represent the real case more closely. — Jouni Helske, Mar 23 '13 at 19:56
This shows a ratio of median times of 0.3400643 in favor of S3. When I do this comparison in R 3.0.0 beta on Mac I get a ratio of the median times of 0.4841773. — IRTFM, Mar 23 '13 at 22:15

cbeleites unhappy with SX · Accepted Answer · 2013-03-23T22:21:43.600

First of all, you can easily have S3 methods for S4 classes:

> extract <- function (x, ...) x@x
> setGeneric ("extr4", def=function (x, ...){})
[1] "extr4"
> setMethod ("extr4", signature= "MyClass", definition=extract)
[1] "extr4"
> `[.MyClass` <- extract
> `[.MyS3Class` <- function (x, ...) x$x
> microbenchmark (objS3[], objS4 [], extr4 (objS4), extract (objS4))
Unit: nanoseconds
           expr   min      lq  median      uq   max neval
        objS3[]  6775  7264.5  7578.5  8312.0 39531   100
        objS4[]  5797  6705.5  7124.0  7404.0 13550   100
   extr4(objS4) 20534 21512.0 22106.0 22664.5 54268   100
 extract(objS4)   908  1188.0  1328.0  1467.0 11804   100

edit: due to Hadley's comment, change the experiment to plot:

> `plot.MyClass` <- extract
> `plot.MyS3Class` <- function (x, ...) x$x
> microbenchmark (plot (objS3), plot (objS4), extr4 (objS4), extract (objS4))
Unit: nanoseconds
           expr   min      lq median      uq     max neval
    plot(objS3) 28915 30172.0  30591 30975.5 1887824   100
    plot(objS4) 25353 26121.0  26471 26960.0  411508   100
   extr4(objS4) 20395 21372.5  22001 22385.5   31359   100
 extract(objS4)   979  1328.0   1398  1677.0    3982   100

for an S4 method for plot I get:

    plot(objS4) 19835 20428.5 21336.5 22175.0 58876   100

So yes, [ has an exceptionally fast dispatch mechanism (which is good, because I think extraction and the corresponding replacement functions are among the most frequently called methods. But no, S4 dispatch isn't slower than S3 dispatch.

Here the S3 method on the S4 object is as fast as the S3 method on the S3 object. However, calling without dispatch is still faster.

there are some things that work much better as S3 such as as.matrix or as.data.frame
For some reason, defining these as S3 means that e.g. lm (formula, objS4) will work out of the box. This doesn't work with as.data.frame being defined as S4 method.
Also it is much more convenient to call debug on a S3 method.
some other things will not work with S3, e.g. dispatching on the second argument.
Whether there will be any noticable drop in performance obviously depends on your class, that is, what kind of structures you have, how large the objects are and how often methods are called. A few μs of method dispatch won't matter with a calculation of ms or even s. But μs do matter when a function is called billions of times.
One thing that caused noticable performance drop for some functions that are called often ([) is S4 validation (a fair number of checks done in validObject) - however, I'm glad to have it, so I use it.Internally I use workhorse functions that skip this step.
In case you have large data and call-by-reference would help your performance, you may want to have a look at reference classes. I've never really worked with them so far, so I cannot comment on this.

That benchmark may be potentially misleading because `[` does internal (C-level dispatch) - I'm not sure if that makes a difference. — hadley, Mar 23 '13 at 21:47
@hadley: thanks. Yes, normal S3 dispatch is slower and maybe even a bit slower than S4 dispatch. — cbeleites unhappy with SX, Mar 23 '13 at 22:16
I thought the question related to `@<-` and `[<-`? But I may have been influenced in that understanding by Chambers discussion in the cited R-devel posting. — IRTFM, Mar 23 '13 at 22:23
@DWin: I thought that was just because of that post. But I may be wrong. In any case, I think one would need to benchmark them with realistic objects, because the size of the whole object and the size of the replaced slot and the size of the replacement values (if only part of the slot is exchanged) will matter. Thus the OP will need to do the benchmark himself. — cbeleites unhappy with SX, Mar 23 '13 at 22:55
Thanks, I think I'll have to think more carefully about my package design (never a bad thing...), what would be the issues that would really benefit from S4 and are they worth of potential performance drop. I know start to feel that there's no point of converting to S4 as I've managed so far without it (package is almost ready). — Jouni Helske, Mar 24 '13 at 05:32

score 2 · Answer 2 · answered Mar 23 '13 at 21:44

2

If you are concerned about performance, benchmark it. If you really need multiple inheritance or multiple dispatch, use S4. Otherwise use S3.

answered Mar 23 '13 at 21:44

hadley

102,019
32
183
245

I thought there was another recommendation for performance but the name of the package now escapes me ;-) – Dirk Eddelbuettel Mar 23 '13 at 21:55
Points should also be awarded on the droll-humor axis. – IRTFM Mar 23 '13 at 22:16
Well, the point of the question was to try get some prior information so that I wouldn't just convert the package to S4, realize it is much slower and then dump it. But I know realize that might be the only way to know for sure. I have managed to circumvent the multiple inheritance with "manual" dispatching via conditional structures, so perhaps it's better to just stick to S3... – Jouni Helske Mar 24 '13 at 05:15

score 1 · Answer 3 · answered Mar 23 '13 at 19:37

(This is pretty close to the boundary of a "question likely to elicit opinion" but believe it is an important issue, one for which you have offered code and data and useful citations, and so I hope there are no votes to close.)

I admit that I have never really understood the S4 model of programming. However, what Chambers' post was saying is that @<-, i.e. slot assignment, was being re-implemented as a primitive rather than as a closure so that it would not require a complete copy of an object when one component was altered. So the earlier state of affairs will be altered in R 3.0.0 beta. On my machine (a 5 year-old MacPro running R 3.0.0 beta) the relative difference was even greater. However, I did not think that was necessarily a good test, since it was not altering an existing copy of a named object with multiple slots.

res <-microbenchmark(structure(list(x=rep(1, 10^7)), class="MyS3Class"),
                new("MyClass", x=rep(1, 10^7)) )
summary(res)[ ,"median"]
#[1] 145.0541 103.4064

I think you should go with S4 since your brain structure is more flexible than mine and there are a lot of very smart people, Douglas Bates and Martin Maechler to name two other than John Chambers, who have used S4 methods for packages that require heavy processing. The Matrix and lme4 package both use S4 methods for critical functions.

Yes I agree on both points, that this might not be a spesific enough question, and that the above benchmarking might not be good example in this case. I added another which should be more like the real case (For example in optimization). — Jouni Helske, Mar 23 '13 at 19:55

Converting package using S3 to S4 classes, is there going to be performance drop?

3 Answers3

Linked