Ensuring reproducibility in an R environment

Question

I work in a computational biology lab, where we have several folks working on multiple projects, mostly in R (which is what I care about for this post). In the past, people would simply develop their code for each project, which may or may not involve boilerplate code copied over from previous projects. One thing that I've pushed over the years was to bring some centralized structure to this mess and have people identify common patterns such that we can turn these repeated/common blocks of code into packages for all of the many reasons one might think that is a good thing to do. So now our folks are using a mix of centralized packages/routines within their project specific scripts.

There's one gotcha here. We have a mandate from the powers that be that every script for every project need to be 100% reproducible over time to the best of our ability (and this includes 100% of all code we have direct access to, including our packages). That is, if I call function foo in package bar with parameter A to get result X today, 4 years from now I should get the exact same result. (erroneous output due to bugs is excepted here)

The topic of reproducibility has come up now and then in R within various circles, but typically it seems to be discussed in terms of reproducibility of process (e.g. vignettes). This is not the same thing - I can run a vignette today and then run the same code 6 months from now using updated packages and receive wildly different results.

The solution that's been agreed upon (which I'm not a fan of) is that if a function or package needs to be changed in a non-backwards compatible change that it simply gets a new name. Thus, if we needed to radically change function foo(), it'd be called foo2(), and if that needs a radical change it gets called foo3(). This ensures that any script that called foo() will always get the original result, while allowing things to march forward within the package repository. It works, but I really dislike this - it seems aesthetically extremely cluttered, and I worry that it will lead to mass confusion over time having packages bar, bar2, bar3, bar4 ... functions foo1, foo2, foo3, etc.

The problem is that I haven't come up with an alternate solution that's really better. One possibility would be to note version numbers of packages, R, etc and make sure those are loaded, but that has multiple problems - not the least of which is that it relies on proper package versioning discipline and that's prone to error. Also, this alternative was already rejected ;) Ideally what we'd have is some sort of notion of devel & release as most of these changes tend to happen earlier on and then level off with changes happening much less frequently. OTOH what devel really means here is "not actually in a package yet" (which we do), but it can be hard to determine exactly at what point is the right one to transport stuff over. Invariably the moment you think you're safe, that's when you realize you're not.

So with all this in mind, I'm curious if anyone else out there has dealt with similar situations, and how they might have resolved things.

edit: just to be clear, by non-backwards compatible, I'm not just talking about APIs and such, but also outputs for a given set of inputs.

Good question, I wish I could offer a better answer but I am in a fairly similar boat myself and we are debating how to move forward on this same issue. — Stedy, Nov 04 '10 at 00:13
I assume you would have a good motivation to radically change the output of the function. I also assume that the changed version would radically improve the old version. Therefore, why would you want to replicate a result that you can re-obtain in a more correct way with the new version? Methods get better, so does the quality of the results. Of course, if your functions are really radically changed they definitely deserve a new name. I don't see any other solution which is more elegant than that. — nico, Nov 04 '10 at 00:34
Too bad package version was already thrown out. I would have opted for that. In your report you would write that the analysis was done using R version x.xx and package hurah_1.0. Appropriate behaviour of functions would be available in vignettes or documentation. Period. If any one wants to reproduce exact results (using the primal datasets), (s)he should check her/his results with newer versions of R/custom package against the original. Alternatively, you could employ a person that would be checking and assuring backward compatibility 24/7. I can be reached at... :) — Roman Luštrik, Nov 04 '10 at 07:35
Why would you get "wildly different" results when running with updated packages? This is fishy because a package is meant to correctly perform a certain (deterministic) calculation. With "wildly different" results, how do you trust ANY of your calculations? — zvrba, Nov 04 '10 at 07:49
@zvrba - I could be wrong, but I think the OP meant the packages developed in-house. Thus, he is concerned that results may vary over time as they improve/expand their packages, not the CRAN packages. But, I could be wrong. But, I do agree that "wildly different" is either hyperbolic or they have a serious quality control problem. — Choens, Nov 04 '10 at 13:03
zvrba/choens: Yes, wildly different is generally hyperbole but not necessarily. Look at Biobase where over the last 9 years they've migrated from 'exprSet' to 'ExpressionSet' to 'AnnotatedDataFrame'. Things do change over time in a non-backwards compatible fashion. For us, it usually involves changes to our methods and/or having developed a deeper understanding of the problem - at the end of the day it's a research environment & not a software development company — geoffjentry, Nov 04 '10 at 21:29
@geoffjentry: sure, but exactly because it's a research environment it makes no sense at all to ensure old results can get exactly reproduced when new methods are available. — nico, Nov 05 '10 at 06:42
Nico: You need to be able to reproduce things that you've published. We might have moved on, but a paper from 5 years ago remains static. — geoffjentry, Nov 05 '10 at 20:15
@geoffjentry consider accept one of solutions, do not wait any longer :) — jangorecki, Mar 30 '16 at 11:19

score 20 · Answer 1 · answered Nov 04 '10 at 01:50

This is indeed an important thing to think about and I think ultimately requires the institutionalization of a couple of different processes.

Version Control (svn, git, bzr, cvs, etc)
Unit Tests

My first reaction is that you need to institutionalize some sort of code management system. This will make it easier, because the old version of foo() is still available, if you really want it. From what you have said, it sounds like you need to package up your common functions and institute some sort of a release schedule. Scripts which require backward compatibility must include the package name and release information. This way it is possible to ALWAYS obtain foo() exactly as it was when the script was written. You should also make sure people only use official release versions in their work, because otherwise this could become quite a pain.

I agree, having a collection of foo:foo99 is doomed to failure. But at least it will be a gloriously confusing failure. Aesthetics aside, it will drive you all bonkers. If foo2() is an improvement (more accurate, faster, etc) of foo(), then it should be called foo() and released for use according to your company-wide release schedule. If it does something different, it is no longer foo(). It might be fooo() or superFoo() or fooMe(), but it ain't foo().

Finally, you need to start testing your functions. (Unit Tests) For each function that is published and made available for others, you should have a clearly defined test suite. Unless someone fixes a bug in foo(), the results should stay the same. If someone fixes a bug, then the results should be more accurate and will probably more desirable in most cases. If you do need to reproduce the old, incorrect, results, you can dig out an old version of foo() from your version control system. By instituting rigorous unit tests, you will know if/when the results of foo have changed. This knowledge should help minimize the number of foo() functions you need. Rather than create a version every time someone tweaks something, you can test the new version to see whether or not the results conform to expectations. But, this is tricky, because you have to make sure that your tests cover anything the function is ever likely to see, including bizarre edge cases. In a research setting, I would imagine that could become a challenge.

We actually do use SVN and do maintain unit tests. That's not really the problem. Your basic suggestion is really what I think is the ideal case, although I suspect it won't really fly - although the fact that several people have said the same thing here might hopefully give me some extra firepower :) — geoffjentry, Nov 04 '10 at 21:35
Interesting. If you are already using SVN and unit tests, it sounds like your problems aren't technological. Your problems are institutional/managerial. Clearly there is some resistance to this from someone(s) in your company. Is this resistance coming from management or the scientists/analysts? — Choens, Nov 04 '10 at 22:36
It is definitely not technological. I'd say more "philosophical" than anything. My main hope here was to see if there were possibilities that I hadn't considered that might be mutually acceptable to all parties. On the upside, there wasn't much out there that I hadn't thought of already. On the downside, there wasn't much out there that I hadn't thought of already. — geoffjentry, Nov 05 '10 at 18:03

score 8 · Answer 2 · answered Nov 04 '10 at 00:40

8

I'm not sure about integrating it with R, but Sumatra might be worth looking into. It appears to allow you to keep track of code and results. So if you need to go back an re-run that simulation from 4 years ago, the code should be there.

answered Nov 04 '10 at 00:40

kmm

6,045
7
43
53

Joris Meys · Answer 3 · 2010-11-04T09:52:08.527

5

Well, ask yourself how you would do that in any other language. There's really nothing more to it than good bookkeeping I'm afraid:

record version numbers of all software involved
put the code in manageable chunks, say in packages.
make sure you have all software/packages involved still available in 5 years.

R can easily be made portable, including all installed packages. Keep a portable version of R together with the used packages, the code and the data on a CD-ROM for each analysis, and you're sure you can reproduce whenever you want. OK, you miss the OS, but can't have them all. In any case, if the OS makes a difference important enough to call the analysis not reproducible, the problem is very likely your analysis. You don't want to tell anybody your result is dependent on the version of Windows you use, do you?

PS : please get into peoples head that they should never ever in their life copy-paste code. They should wrap it in functions and use those. A whole lot easier and far less error-prone. I mean, what's the difference between copying

x <- read.table("sometable")
y <- ColSums(x)/4.3

and adjusting the values, or typing

myfun <- function(i,j){
  x <- read.table(i)
  y <- ColSums(x)/j
}

Saves you and a lot of other people a whole lot of copy-paste trouble. (How so, object not found? What object?)

edited Nov 04 '10 at 09:52

answered Nov 04 '10 at 09:45

Joris Meys

106,551
31
221
263

On your latter point, this was the point of this whole endeavor in the first place ;) – geoffjentry Nov 04 '10 at 21:35
@geoffjentry : ah true, kind of redundant information I guess. main point was the portable analysis CD-ROMs though. – Joris Meys Nov 04 '10 at 22:00
Yeah - this and the sumatra suggestion led to a discussion along these lines today about bundling the entire environment up with a project. The CD (or even DVD) wouldn't work for us due to the size of the data. The only real issue about having self contained environments though would be common annotations, data, etc (e.g. information about teh genome) as this can get pretty bulky. It's a thought, that's for sure. – geoffjentry Nov 04 '10 at 22:59
@geoffjentry: I reckon the data is then kept somewhere in a database system on a server or so? At my last job, we had a backup server exactly for that purpose: keeping snapshots of the database. Don't know how it's arranged at your job, but storage capacity is pretty cheap these days, so it could be kept on a server with a few terrabyte installed in it? – Joris Meys Nov 05 '10 at 12:22
Our data storage is currently around 30TB and growing at an ever increasing rate. Just getting backup on that amount is a challenge :) How it is stored depends exactly on the 'what' for different data though. – geoffjentry Nov 05 '10 at 18:02
@geoffjentry: uff... but that also sheds a different light on the reproducibility. I don't know if sequences/annotations get updated often, but if you can't guarantee the data to be the same, there's little use in trying to get the same result. Which doesn't mean you can't keep the analysis for future use though, if that's the reproducibility you look for. But that won't suffice as a criterium for published results I'm afraid... – Joris Meys Nov 06 '10 at 02:39

score 5 · Answer 4 · answered Jun 23 '11 at 12:25

5

Whenever you want to freeze your code in a way that needs to be reproducible "forever", e.g., when your paper has been published, the safest way to do this is to create a virtual machine containing all your code and data and the software needed to run it (including the operating system). There's an example here on the University of Washington site.

answered Jun 23 '11 at 12:25

Richie Cotton

118,240
47
247
360

excellent point... there is another very nice example from the PEcAn project here: https://ebi-forecast.igb.illinois.edu/redmine/projects/pecan-1-2-5. – David LeBauer Sep 21 '12 at 22:28

score 3 · Answer 5 · answered Nov 04 '10 at 00:59

3

This is exactly the kind of thinking that causes Microsoft to maintain bug compatibility in Excel. Rather than attempting to conform to such a request you should be doing your best to show that it's not a good idea.

This thinking means that all errors remain errors in order to maintain consistency. It's thinking transferred from corporate bureaucracy and has no business in a science lab.

The only way to do this is to save the copy of all your packages and version of R with your code. There's no central corporation beholden to bug compatibility that's going to take care of that for you.

answered Nov 04 '10 at 00:59

John

23,360
7
57
83

@John: actually, this is science bureaucracy. If you publish a paper, you have to make sure you can show and rerun the analysis anytime when asked. Quite shameful if you then have to admit the results suddenly turn out to be different... – Joris Meys Nov 04 '10 at 10:13
It's not shameful. It's an error, and to be expected. Furthermore, if you merely maintain your original code then changes would be in add on packages or base R... making the responsibility more diffuse. – John Nov 04 '10 at 14:13
What Joris said. Changes are not necessarily errors. If we publish on a result, we're expected to be able to reproduce those same results - if there are real errors obviously those need to be fixed. – geoffjentry Nov 04 '10 at 21:34
OK, I can agree that changes aren't necessarily errors. Packages that don't have 'true' results (like lme4) may change without error. But no one's discriminating errors from changes here. My original assertion stands in spite of that. This kind of thinking will cause people to consider insuring bug compatibility. Just the reason cited here, embarrassment, will be enough. Sure real errors obviously need to be fixed, but that's not mentioned. I've seen these kinds of policies and they generally undercut drives to fix real errors. Replicability is all. – John Nov 05 '10 at 16:40
I'm reminded that method sections in research papers should enable one to exactly replicate the study. They almost never are. Is this a bad thing or a good thing? Most everyone can have a knee jerk thought about which answer is correct. but it's actually a deep problem. Consider social and motivational aspect of scientists. It's a bit much to go into in comments I guess... maybe I'll write a substantially more detailed answer later. – John Nov 05 '10 at 16:51
Agreed that the methods almost never are. It's actually somewhat of an irritant at times. That doesn't mean that one shouldn't strive for that though. I don't make those calls though :) – geoffjentry Nov 05 '10 at 18:01

score 3 · Answer 6 · answered Nov 04 '10 at 07:44

What if a change in result is due to a change in your operating system? Perhaps Microsoft fix a bug in Windows XP for Windows 7 and then when you upgrade - all your outputs are different.

If you want to handle this then I think the best way of working is to keep snapshots of virtual machines when you close out an analysis, and store the VM images for later use. Of course in five years time you won't have a license to run Windows XP so that's another problem - one solved by using an open-source operating system, such as Linux.

score 2 · Answer 7 · answered Dec 19 '15 at 00:46

I would go with docker images.
This is pretty convenient way to reproduce OS and all dependencies.
You build an image and later can deploy it any time to docker, it will be fully configured.
You can find multiple R docker images available, so you can easily build your image upon them.
Having already built image you can use it to deploy to Test environment and later to Production.

score 1 · Answer 8 · edited May 23 '17 at 10:29

A solution might be to use S4 methods and letting R's internal dispatcher do the work for you (see example below). That way, you're somewhat "bulletproof" with respect to being able to systematically update your code without running the risk of breaking something.

Key benefits

The key thing here is that S4 methods support multiple dispatch.

That way your function will always be foo (as opposed to having to keep track of foo1, foo2 etc.) while new functionality can be easily implemented (by adding respective methods) without touching "old" methods (that other people/packages might rely on).

Key functions you'll need:

setGeneric
setMethod
setRefClass (S4 Reference Classes; personal recommendation) or setClass (S4 Class; I wouldn't use them for the reason described in the "Additional remarks" at the very end)

The "downsides"

You need to switch from a S3 to a S4 logic
This implies that you need to write a bit more code than what you might be used to (generic method definitions, method definitions and possibly own class defitions (see example below). But this "buys" yourself and your code much more structure and makes it more robust.
It might also imply that you'll eventually dig deeper and deeper into the world of Object-Oriented Programming or Object-Oriented Design. While I personally consider this to be a good thing (my personal rule of thumb: the more complex/distributed your application, the better you're off using OOP), some would consider these approaches to be R-untypic (I strongly disagree as R does have superb OO-features that are maintained by the Core Team) or "unsuited" for R (this might be true depending on how much you rely on "non-OOP" packages/code). If you're willing to go that way, you might want to familiarize yourself with the SOLID principles of Object-Oriented Design. You also might want to check out the following books: Clean Coder and The Pragmatic Programmer.
If computational efficiency (e.g. when estimating statistical models) is really critical, using S4 methods and S4 Reference Classes might slow you down a bit. After all, there's more code involved compared to S3. But I'd recommend testing the impact of this from case to case via system.time() and/or microbenchmark::microbenchmark() instead of picking "ideological" sides (S3 vs. S4).

Example

Initial function

Let's suppose you're in department A and someone in your team started out with creating a function called foo()

foo <- function(x, y) {
    x + y
}
foo(x=10, y=20)

First change request

You would like to be able to extend it without breaking "old" code that relies on foo().

Now, I think we all agree that this can be quite hard to do.

You either need to explicitly modify the source code of foo() (each time running the risk that you break something that already used to work; this violates the "O" in SOLID: Open Closed-Principle) or you need to come with alternative names such as foo1, foo2 etc (really hard to keep track of which function is doing what).

foo <- function(x, y, type=c("old", "new")) {
    type <- match.arg(type, choices=c("old", "new")) 
    if (type == "old") {
        x + y
    } else if (type == "new") {
        x * y    
    }
}
foo(x=10, y=20)
[1] 30
foo(x=10, y=20, type="new")
[1] 200

foo1 <- function(x, y) {
    x * y
}
foo1(x=10, y=20)
[1] 200

Let's see how S4 methods and multiple dispatch can really help us out here.

Generic method

You need to start out by turning foo() into a generic method.

setGeneric(
    name="foo",
    signature=c("x", "y", ".ctx", ".ns"),
    def=function(x, y, ..., .ctx, .ns) {
        standardGeneric("foo")
    }
)

In simplified words: a generic method itself doesn't do anything yet. It's simply a precondition in order to be able to specifiy "actual" methods for its signature arguments that do something useful.

Signature arguments

The degree of flexiblity with respect to the original problem is directly linked to the number of signature arguments that you declare (signature=c("x", "y", ".ctx", ".ns")): the more signature arguments, the more flexiblity you have but the more complex your code might get as well (with respect to how much code you have to write).

Again, in simplified words: signature arguments (and it's classes) are used by the method dispatcher to retrieve the correct method that's doing the actual work.

Think of the method dispatcher being like the clerk in a ski rental business: you present him an arbitrary large set of signature information (i.e. information that "clearly distinguish you from others": your age, height, shoe size and skill level) and he uses that information to provide you with the right equipment to hit the slopes. Think of R's method dispatcher as beeing the clerk that has access to the storage room of the ski rental. But instead of ski equipment it will return methods.

Notice that we said that our "old" arguments x and y are from now on supposed to be signature arguments while there are also two new arguments: .ctx and .ns. I'll get to these in a minute. It's those arguments that will provide us with the flexibility that we're after.

Initial method definition

We now define a "variant" (a method) of the generic method for the following "signature scenario":

x is numeric
y is numeric
.ctx will just not be provided when calling the method and is thus missing
.ns will just not be provided when calling the method and is thus missing

Think of it as registering your signature information with explicit equipment of the ski rental. Once you did that and ask for your equipment, the only thing the clerk has to do is to go to the storage room and look up which equipment is linked to your personal information.

setMethod(
    f="foo", 
    signature=signature(x="numeric", y="numeric", .ctx="missing", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        x + y
    }
)

When we call foo with this "signature scenario" (asking for the method that we registered for this scenario), the method dispatcher knows exactly which actual method it needs to get out of the storage room:

foo(x=10, y=20)
[1] 30

First update

Now someone from department B comes along, looks at foo(), likes it but decides that foo() needs to be updated (x * y instead of x + y) if it is to be used in his department.

That's when .ctx (short for context) comes into play: it's an argument by which we are able to distinguish application contexts.

Definining a class that represents the new application context

setRefClass("ApplicationContextDepartmentB")

When calling foo(), we'll provide it with an instance of this class (.ctx=new("ApplicationContextDepartmentB"))

Definining a new method for the new application context

Notice how we register signature argument .ctx to our new class ApplicationContextDepartmentB:

setMethod(
    f="foo", 
    signature=signature(x="numeric", y="numeric", 
        .ctx="ApplicationContextDepartmentB", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        out <- x * y
        attributes(out)$description <- "I'm different from the original foo()"
        return(out)
    }
)

That way, the method dispatcher knows exactly that it should return the "new" method instead of the "old" method when we call foo() like this:

foo(x=1, y=10, .ctx=new("ApplicationContextDepartmentB"))
[1] 10
attr(,"description")
[1] "I'm different from the original foo()"

The "old" method is not affected at all:

foo(x=1, y=10)
[1] 30

Second update

Suppose that someone from department C comes along and suggests yet another "configuration" or version for foo(). You can easily provide that withouth breaking anything that you've realized for departments A and B so far by following the same routine as for department B.

But we'll even take it one step further here: we'll define two additional classes that let us distinguish different "namespaces" (that's where .ns comes into play).

Think of namespaces as a way of distinguishing different runtime scenarios for a specific method for a specific application context (i.e. "testing" and "productive mode").

Definining the classes

setRefClass("ApplicationContextDepartmentC")
setRefClass("TestNamespace")
setRefClass("ProductionNamespace")

Definining a new method for the new application context and a "test" scenario

Notice how we register signature arguments .ctx to our new class ApplicationContextDepartmentC and .ns to our new class TestNamespace:

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="TestNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y, test.ok=rep(TRUE, length(x)))
    }
)

Again, the method dispatcher will look up the correct method when calling foo() like this:

foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"), 
    .ns=new("TestNamespace"))
  x  y test.ok
1 a 11    TRUE
2 b 12    TRUE
3 c 13    TRUE
4 d 14    TRUE
5 e 15    TRUE

Definining a new method for the new application context and a "productive" scenario

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y)
    }
)

We tell the method dispatcher that we now want the method registered for this scenario or namespace like this:

foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"), 
    .ns=new("ProductionNamespace"))

  x  y
1 a 11
2 b 12
3 c 13
4 d 14
5 e 15

Notice that you're free to use the classes TestNamespace and ProductionNamespace anywhere you'd like. These classes are not bound to ApplicationContextDepartmentC in any way, so you can for example also use the for all your other application scenarios.

Additional remarks for method definitions

Something that's often quite usefull is to start out with a method that accepts ANY classes for its signature arguments and define more restrictive methods as your software evolves:

setMethod(
    f="foo", 
    signature=signature(x="ANY", y="ANY", .ctx="missing", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        message("Value of x:")
        print(x)
        message("Value of y:")
        print(y)
    }
)
foo(x="Hello World!", y=rep(TRUE, 3))
Value of x:
[1] "Hello World!"
Value of y:
[1] TRUE TRUE TRUE

Additional remarks for class definitions

I prefer S4 Reference Classes over S4 Classes because of the self-referencing capabilities of S4 Reference Classes:

setRefClass(
    Class="A", 
    fields=list(
        x1="numeric",
        x2="logical"
    ),
    methods=list(
        getX1=function() {
            .self$x1
        },
        getX2=function() {
            .self$x2
        },
        setX1=function(x) {
            .self$x1 <- x
        },
        setX2=function(x) {
            .self$field("x2", x)
        },
        addX1AndX2=function() {
            .self$getX1() + .self$getX2()
        }
    )
)
x <- new("A", x1=10, x2=TRUE)
x$getX1()
[1] 10
x$getX2()
[1] TRUE
x$addX1AndX2()
[1] 11

S4 Classes don't have that feature.

Subsequent modifications of field values:

x$setX1(100)
x$addX1AndX2()
[1] 101
x$x1 <- 1000
x$addX1AndX2()
[1] 1001

Additional remarks for documenting methods and classes

I strongly recommend using packages roxygen2 and devtools to document your methods and classes. You possibly might also want to look into package roxygen3.

Documenting generic methods with roxygen2:

#' Foo
#'
#' This method takes \code{x} and \code{y} and adds them.
#' 
#' Some details here
#' 
#' @param x \strong{Signature argument}.
#' @param y \strong{Signature argument}.
#' @param ... Further arguments to be passed to subsequent functions.
#' @param .ctx \strong{Signature argument}.
#'      Application context.
#' @param .ns \strong{Signature argument}.
#'      Application namespace. Usually used to distinguish different context 
#'      versions or configurations.
#' @author Janko Thyson \email{john.doe@@something.com}
#' @references \url{http://www.something.com/}
#' @example inst/examples/foo.R
#' @docType methods
#' @rdname foo-methods
#' @export

setGeneric(
    name="foo",
    signature=c("x", "y", ".ctx", ".ns"),
    def=function(x, y, ..., .ctx, .ns) {
        standardGeneric("foo")
    }
)

Documenting methods with roxygen2:

#' @param x \code{\link{character}}. Character vector.
#' @param y \code{\link{numeric}}. Numerical vector.  
#' @param .ctx \code{\link{ApplicationContextDepartmentC}}. 
#' @param .ns \code{\link{ProductionNamespace}}.  
#' @return \code{\link{data.frame}}. Some data frame.
#' @rdname foo-methods
#' @aliases foo,character,numeric,missing,missing-method
#' @export

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y)
    }
)

score 1 · Answer 9 · answered Aug 01 '11 at 23:30

This may be a late answer, but I have found it useful to create a generic wrapper like the following, especially when iterating quickly in my development of a new function:

myFunction <- function(..., version = "latest"){
  if((version == "latest") || (version == 6)){
    return(myFunction06(...))
  } ...
  if((version == 1)){
    return(myFunction01(...))
  }
 }

Then, code should simply state which version it wants. Once the actual function stabilizes, I remove support for the older versions of the function, and a quick search through my code lets me find any offending calls. Use of "latest" means I can assure that the caller and the function match some fairly fixed definitions.

Naturally, all code is maintained in a version control system, so even when I remove the earlier code, it is only from the currently available source. I can reproduce any behavior from any point in time, including errors, as long as the data from that point in time is obtainable.