17

Background

The dispatch mechanism of the R functions rbind() and cbind() is non-standard. I explored some possibilities of writing rbind.myclass() or cbind.myclass() functions when one of the arguments is a data.frame, but so far I do not have a satisfactory approach. This post concentrates on rbind, but the same holds for cbind.

Problem

Let us create an rbind.myclass() function that simply echoes when it has been called.

rbind.myclass <- function(...) "hello from rbind.myclass"

We create an object of class myclass, and the following calls to rbind all properly dispatch to rbind.myclass()

a <- "abc"
class(a) <- "myclass"
rbind(a, a)
rbind(a, "d")
rbind(a, 1)
rbind(a, list())
rbind(a, matrix())

However, when one of the arguments (this need not be the first one), rbind() will call base::rbind.data.frame() instead:

rbind(a, data.frame())

This behavior is a little surprising, but it is actually documented in the dispatch section of rbind(). The advice given there is:

If you want to combine other objects with data frames, it may be necessary to coerce them to data frames first.

In practice, this advice may be difficult to implement. Conversion to a data frame may remove essential class information. Moreover, the user who might be unware of the advice may be stuck with an error or an unexpected result after issuing the command rbind(a, x).

Approaches

Warn the user

A first possibility is to warn the user that the call to rbind(a, x) should not be made when x is a data frame. Instead, the user of package mypackage should make an explicit call to a hidden function:

mypackage:::rbind.myclass(a, x)

This can be done, but the user has to remember to make the explicit call when needed. Calling the hidden function is something of a last resort, and should not be regular policy.

Intercept rbind

Alternatively, I tried to shield the user by intercepting dispatch. My first try was to provide a local definition of base::rbind.data.frame():

rbind.data.frame <- function(...) "hello from my rbind.data.frame"
rbind(a, data.frame())
rm(rbind.data.frame)

This fails as rbind() is not fooled in calling rbind.data.frame from the .GlobalEnv, and calls the base version as usual.

Another strategy is to override rbind() by a local function, which was suggested in S3 dispatching of `rbind` and `cbind`.

rbind <- function (...) {
  if (attr(list(...)[[1]], "class") == "myclass") return(rbind.myclass(...))
  else return(base::rbind(...))
}

This works perfectly for dispatching to rbind.myclass(), so the user can now type rbind(a, x) for any type of object x.

rbind(a, data.frame())

The downside is that after library(mypackage) we get the message The following objects are masked from ‘package:base’: rbind .

While technically everything works as expected, there should be better ways than a base function override.

Conclusion

None of the above alternatives is satisfactory. I have read about alternatives using S4 dispatch, but so far I have not located any implementations of the idea. Any help or pointers?

Stef van Buuren
  • 348
  • 1
  • 10
  • 1
    AFAIK, smarter people then me such as Matt Dowle and Hadley Wickham so far haven't been able to solve this in an elegant way. You could study their workarounds in data.table or tidyverse (maybe dplyr?) package sources. – Roland Dec 25 '17 at 09:21
  • Thanks. This is helpful. I looked into what's done in `data.table`. When the user issues `library(data.table)` the package redefines the `base::rbind.data.frame` to dispatch to internal method. It work for `data.table`, but I fear the principle doing this so simultaneously in multiple packages is asking for trouble. In FAQ 2.23 Matt says: "If there is a better solution we will gladly change it." – Stef van Buuren Dec 25 '17 at 14:24
  • I fear we hit the wall. Hadley calls this "unfixable" https://github.com/tidyverse/dplyr/issues/606#issuecomment-56529411, and seems to have given up after a lot of trying. So no solution also here... – Stef van Buuren Dec 25 '17 at 14:44
  • 1
    If you have an actual use case, I'd suggest raising this on the R-devel mailing list. In my opinion this is a design flaw that should be fixed in R. No warranty that this would get fixed, though. Especially if you don't supply a patch. – Roland Dec 25 '17 at 15:43
  • This is an option, but I'm sure there are reasons that R works in the way it works. I raised this question in the hope of finding an alternative. – Stef van Buuren Dec 26 '17 at 06:10
  • What was a good reason more than a decade ago isn't necessarily a good reason today. I suspect rbind's dispatch is a legacy from very early R versions. It was probably implemented to speed up S3 dispatch for rbind. However, one should be able to dispatch a non-data.frame method for data.frames that have another class as the first class. I suspect that such a scenario was not considered back then. – Roland Dec 26 '17 at 09:51

3 Answers3

5

As you mention yourself, using S4 would be one good solution that works nicely. I have not investigated recently, with data frames as I am much more interested in other generalized matrices, in both of my long time CRAN packages 'Matrix' (="recommended", i.e. part of every R distribution) and in 'Rmpfr'.

Actually even two different ways:
1) Rmpfr uses the new way to define methods for the '...' in rbind()/cbind(). this is well documented in ?dotsMethods (mnemonic: '...' = dots) and implemented in Rmpfr/R/array.R line 511 ff (e.g. https://r-forge.r-project.org/scm/viewvc.php/pkg/R/array.R?view=annotate&root=rmpfr)

2) Matrix uses the older approach by defining (S4) methods for rbind2() and cbind2(): If you read ?rbind it does mention that and when rbind2/cbind2 are used. The idea there: "2" means you define S4 methods with a signature for two ("2") matrix-like objects and rbind/cbind uses them for two of its potentially many arguments recursively.

Martin Mächler
  • 4,619
  • 27
  • 27
  • Thanks. These are useful suggestions. I will study these option, and try to develop example code. – Stef van Buuren Dec 28 '17 at 06:34
  • If you take the cbind2/rbind2 approach (but maybe also for the '...'), you'd have to use `setOldClass("mids")` [or a version with more arguments, e.g. `S4class = .`, see also the ` ?Methods_for_S3 ` -- which is relatively recent (2016) BTW and the nice "technology" may not have been used extensively yet... and so we (me, JMC,..) may be interested to help. – Martin Mächler Dec 29 '17 at 09:29
2

The dotsMethod approach was suggested by Martin Maechler and implemented in the Rmpfr package. We need to define a new generic, class and a method using S4.

setGeneric("rbind", signature = "...")
mychar <- setClass("myclass", slots = c(x = "character"))
b <- mychar(x = "b")
rbind.myclass <- function(...) "hello from rbind.myclass"
setMethod("rbind", "myclass",
      function(..., deparse.level = 1) {
        args <- list(...)
        if(all(vapply(args, is.atomic, NA)))
          return( base::cbind(..., deparse.level = deparse.level) )
        else
          return( rbind.myclass(..., deparse.level = deparse.level))
      })

# these work as expected
rbind(b, "d")
rbind(b, b)
rbind(b, matrix())

# this fails in R 3.4.3
rbind(b, data.frame())

Error in rbind2(..1, r) :
    no method for coercing this S4 class to a vector

I haven't been able to resolve the error. See R: Shouldn't generic methods work internally within a package without it being attached? for a related problem.

As this approach overrides rbind(), we get the warning The following objects are masked from 'package:base': rbind.

Stef van Buuren
  • 348
  • 1
  • 10
  • I think this will only work if (1) myclass is an S4 object; (2) it does not inherit from data.frame; (3) you setClassUnion(myclass, data.frame) to somthing, say myclass_data.frame; (4) you define rbind.myclass_data.frame. I could be wrong here, but if not, I still stand by my answer below. – Patrick Perry Dec 28 '17 at 14:18
1

I don't think you're going to be able to come up with something completely satisfying. The best you can do is export rbind.myclass so that users can call it directly without doing mypackage:::rbind.myclass. You can call it something else if you want (dplyr calls its version bind_rows), but if you choose to do so, I'd use a name that evokes rbind, like rbind_myclass.

Even if you can get r-core to agree to change the dispatch behavior, so that rbind dispatches on its first argument, there are still going to be cases when users will want to rbind multiple objects together with a myclass object somewhere other than the first. How else can users dispatch to rbind.myclass(df, df, myclass)?

The data.table solution seems dangerous; I would not be surprised if the CRAN maintainers put in a check and disallow this at some point.

Patrick Perry
  • 1,422
  • 8
  • 17
  • Agree. Best is to export `rbind.myclass()` and call `rbind()`. The point is that you'll get the wrong answer if the first argument is `myclass`, and the second or later argument is `data.frame`. Renaming evades the problem, but the downside is that the user will need to deal with multiple names for conceptually the same operation. There are already too many examples of this in R. Also, it would break my existing code. The call `rbind(df, df, myclass)` dispatches to `base::rbind.data.frame()`, which is correct. `rbind()` handles multiple arguments by design, so no need to change that. – Stef van Buuren Dec 27 '17 at 06:19