11

I want to create a data.frame of different variables, including S4 classes. For a built-in class like "POSIXlt" (for dates) this works fine:

as.data.frame(list(id=c(1,2), 
                   date=c(as.POSIXlt('2013-01-01'),as.POSIXlt('2013-01-02'))

But now i have a user defined class, let's say a "Person" class with name and age:

setClass("person", representation(name="character", age="numeric"))

But the following fails:

as.data.frame(list(id=c(1,2), pers=c(new("person", name="John", age=20),
                                     new("person", name="Tom", age=30))))

I also tried to overload the [...]-Operator for the person class using

setMethod(
  f = "[",
  signature="person",
  definition=function(x,i,j,...,drop=TRUE){ 
    initialize(x, name=x@name[i], age = x@age[i])
  }
)

This allows for vector-like behavior:

persons = new("person", name=c("John","Tom"), age=c(20,30))
p1 = persons[1]

But still the following fails:

as.data.frame(list(id=c(1,2), pers=persons))

Perhaps I have to overload more operators to get the user defined class into a dataframe? I am sure, there must be a way to do this, as POSIXlt is an S4 class and it works! Any solution using the new R5 reference classes would be also fine!

I do not want to put all my data into the person class (You could ask, why "id" is not a member of person I just do not use dataframes)! The idea is that my data.frame represents a table from a database with many columns with different types, e.g., strings, numbers,... but also dates, intervals, geo-objects, etc... While for dates I already have a solution (POSIXlt), for intervals, geo-objects, etc. I probably need to specify my own S4/R5 classes.

Thanks a lot in advance.

Karsten W.
  • 17,826
  • 11
  • 69
  • 103
Patrick Roocks
  • 3,129
  • 3
  • 14
  • 28
  • POSIXlt objects are S3 classes, not S4. `p = as.POSIXlt('2013-01-01'); isS4(p)` returns FALSE. – Spacedman Jan 30 '13 at 13:07
  • You are right. Additionally "person" is already an R class. Using S3 classes, i do not get an error, i.e. `p = structure(list(name="Tom", age=20), class = "mypers"); as.data.frame(list(id=c(1,2), pers=c(p,p)))` runs. But I get a result with 5 columns (instead of 2; one for id, one for pers). Still not a solution for the problem. Thanks anyway! – Patrick Roocks Jan 30 '13 at 14:21

2 Answers2

9

Here's your class, with a "column" interpretation of its definition, rather than row; this will be important for performance; also date for reference

setClass("person", representation(name="character", age="numeric"))
pers <- new("person", name=c("John", "Tom"), age=c(20, 30))
date <- as.POSIXct(c('2013-01-01', '2013-01-02'))

Some experimenting, including looking at methods(class="POSIXct") and paying attention to error messages led me to implement as.data.frame.person and format.person (the latter is used for display in a data.frame) as

as.data.frame.person <-
    function(x, row.names=NULL, optional=FALSE, ...)
{
    if (is.null(row.names))
        row.names <- x@name
    value <- list(x)
    attr(value, "row.names") <- row.names
    class(value) <- "data.frame"
    value
}

format.person <- function(x, ...) paste0(x@name, ", ", x@age)

This gets me my objects in a data.frame:

> lst <- list(id=1:2, date=date, pers=pers)
> as.data.frame(lst)
     id       date     pers
John  1 2013-01-01 John, 20
Tom   2 2013-01-02  Tom, 30

If I want to subset, then I need

setMethod("[", "person", function(x, i, j, ..., drop=TRUE) {
    initialize(x, name=x@name[i], age=x@age[i])
})

I'm not sure what other methods might be required as more data.frame operations are encountered, there is no "data.frame interface".

Using the vectorized class in data.table seems to require a length method for construction.

> library(data.table)
> data.table(id=1:2, pers=pers)
Error in data.table(id = 1:2, pers = pers) : 
  problem recycling column 2, try a simpler type
> setMethod(length, "person", function(x) length(x@name))
[1] "length"
> data.table(id=1:2, pers=pers)
   id     pers
1:  1 John, 20
2:  2  Tom, 30

Maybe there's a data.table interface?

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thank you very much for the idea of overloading as.data.frame.CLASS. With the help of your ideas, I even managed it to get my original code to run - i overloaded the c-method using `c.person = function(...) { args = list(...); return(new("personx", name=sapply(args, function(x) x@name), age=sapply(args, function(x) x@age)))}` Then the result of `c(p1,p2)` where p1,p2 are persons can be included in a data.frame. – Patrick Roocks Jan 30 '13 at 15:55
  • I don't use S4 much I'm afraid so I'm a bit lost. @SteveLianoglou is a co-author on the data.table project and knows S4 much better than I. – Matt Dowle Jan 30 '13 at 15:57
  • @MatthewDowle it's not so much an S4 issue; what methods would an S3 class need to implement to be data.table compatible, e.g., `pers = structure(list(name=c("Tom", "Bob"), age=c(20, 30), sex=c("M", "M")), class="s3person")` might have `length.s3person = function(x) length(x$name)`, i.e., length of the contained vector rather than the list that holds the class. What methods are required for `data.table(id=1:2, pers=pers)`, etc., to work? – Martin Morgan Jan 30 '13 at 16:06
  • Oh I see. Just wrap `pers` in `list()` e.g. `data.table(id=1:2, pers=list(pers))` works and it then recycles the object and prints the class name in the cell value `""`. IIUC. – Matt Dowle Jan 30 '13 at 16:14
  • But in data.table, each column must be vector atomic (including list). You can't have 2 vectors in one column, even as a class. The underlying structure in memory would be too hard to group efficiently. Either they need to be separate columns like in a database: Name, Age, Sex. Or the object could be a single person, and you have a new instance in each cell, which is what I showed in the last comment (but isn't what you wanted I think because both Tom and Bob appear on both rows in that example). – Matt Dowle Jan 30 '13 at 16:22
  • 1
    Yes, it's expensive in S3 and especially S4, to create one-instance-per-row, rather than one-instance-per-column, and the one-instance-per-column fits both with R's vectorized notions and the structure of a data frame (which is one-instance-per-column). Probably discussion for some other forum. – Martin Morgan Jan 30 '13 at 16:35
  • This example doesn't seem to work anymore. Can you update it? – thc Apr 21 '21 at 16:24
2

Judging by this thread on the mailing list:

http://tolstoy.newcastle.edu.au/R/e2/devel/06/11/1013.html

...John Chambers was thinking about this in 2006. And still we can't put S4 objects in columns of data frames. We also can't put complex S3 classes in columns of data frames neither.

There are some other tabular data structures that might do it - data.table perhaps:

require(data.table)
setClass("geezer", representation(name="character", age="numeric"))
tom=new("geezer",name="Tom",age=20)
dick=new("geezer",name="Dick",age=23)
harry=new("geezer",name="Harry",age=25)
gt = data.table(geezers=c(tom,dick,harry),weapons=c("Gun","Gun","Knife"))
gt
    geezers weapons
1: <geezer>     Gun
2: <geezer>     Gun
3: <geezer>   Knife

The semantics of data.table are a bit different to data.frame, and don't expect to be able to plug a data.table into any code that uses a data.frame and expect it to work (For example, I suspect lm and glm will go wobbly). But it seems the data.table authors allow compound classes in columns...

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • 1
    +1 `lm` and `glm` won't go wobbly! `data.table` is fully compatible with them. They are blissfully _unaware_ of `data.table`. [This answer](http://stackoverflow.com/a/10529888/403310) explains how it works. We aren't aware of any incompatibility with any other package at all. They use tradition `[.data.frame` syntax on the `data.table` and it works. I know people often say _mostly_ compatible, but we aren't aware of _any_ incompatibility. The last paraphraph of FAQ 2.17 refers to this. But it really should be prominent in `?data.table` too (will add). – Matt Dowle Jan 30 '13 at 15:25
  • As long as I don't try and fit a model with a compound S4 class from a data.table in it? That might be an odd thing to do, but someone might expect those objects to work like a factor... – Spacedman Jan 30 '13 at 15:32
  • 1
    True, maybe not that. Actually is there a typo in that paragraph: should "uses a `data.table`" be "uses a `data.frame`" ? – Matt Dowle Jan 30 '13 at 15:42
  • But yes objects as cell values is an intended feature, although perhaps not as well tested as simpler stuff. – Matt Dowle Jan 30 '13 at 15:43
  • Thank you very much for the idea of using a data.table! I am just dreaming about of having an package which makes it possible to import a database table via JDBC into R, where really all JDBC objects (including intervals, geo-objects, ...) of the given columns are represented in an appropiate S4/R5 classes. There is the RJDBC package, which already offers a smart way of querying a database with R and JDBC... but unfortunately everything is converted to character and numeric... such a data.table could offer much more! – Patrick Roocks Jan 30 '13 at 16:05