37

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.

The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
geoffjentry
  • 4,674
  • 3
  • 31
  • 37

2 Answers2

29

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:

set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)

system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
  d <- x  # Skip 'c' so Joris doesn't down-vote me! ;-)
  class(d) <- "data.frame"
  rownames(d) <- 1:n
  names(d) <- make.unique(names(d))
})

identical(a, b)  # TRUE
identical(b, d)  # TRUE

Update - this is ~2x faster than creating d:

system.time({
  e <- x
  attr(e, "row.names") <- c(NA_integer_,n)
  attr(e, "class") <- "data.frame"
  attr(e, "names") <- make.names(names(e), unique=TRUE)
})

identical(d, e)  # TRUE

Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.

set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({  # makes 2 copies
  attr(f, "row.names") <- c(NA_integer_,n)
  attr(f, "class") <- "data.frame"
  attr(f, "names") <- make.names(names(f), unique=TRUE)
})

set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({  # only makes 1 copy
  attributes(g) <- list(row.names=c(NA_integer_,n),
    class="data.frame", names=make.names(names(g), unique=TRUE))
})

identical(f,g)  # TRUE
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 2
    Leave "probably" out of the answer and it's correct. It's also correct if you make a function using those calls and replacing the cheat of knowing n with a length command. Your new function is roughly equivalent to data.frame() after removing all of the extensive checks. So, if you know for sure you're handing the call the right input then just do what Josh recommended for speed. If you're unsure then data.frame is safer and, do.call(data.frame, x)) is next fastest (oddly enough). – John May 09 '11 at 22:18
  • 3
    See `plyr::quickdf` for exactly this function. – hadley May 09 '11 at 23:25
  • @hadley: `plyr::quickdf` doesn't provide exactly this function; namely it doesn't make unique column names. `plyr:::make_names` only replaces missing names and doesn't have a `unique=` arg like `base::make.names`. – Joshua Ulrich May 10 '11 at 00:58
  • 1
    @John: By "probably" I meant "to the best of my knowledge". I try not to speak too strongly if I'm not absolutely certain. – Joshua Ulrich May 10 '11 at 01:08
  • 1
    Ok, not exactly, but pretty close (unique column names aren't a prerequisite for a valid data frame). I'm not sure that memory hacks based on undocumented behaviour of `attributes<-` are a good idea. – hadley May 10 '11 at 03:09
  • 1
    @hadley: what memory hacks? I was merely pointing out that 1 call to `attributes<-` makes fewer copies than 3 calls to `attr<-`. – Joshua Ulrich May 10 '11 at 03:18
  • 2
    Nice demo of `tracemem` in action, and a good illustration of the difference between lists and data frames. – Richie Cotton May 10 '11 at 10:07
  • @Joshua : +1 for skipping c ;) – Joris Meys May 10 '11 at 11:17
  • Maybe I misinterpreted your answer - `structure` is the canonical way of returning an object with modified attributes. – hadley May 10 '11 at 20:29
  • 3
    @hadley: canonical according to whom? I can't find any discussion of that in the manuals and `attr<-` and `structure` seem to be used about equally often in the core R sources... and `structure` uses `attributes<-`. – Joshua Ulrich May 10 '11 at 21:00
10

This appears to need a data.table suggestion given that efficiency for large datasets is required. Notably setattr sets by reference and does not copy

library(data.table)
set.seed(21)
n <- 1e6
h <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
h <- c(h,h,h,h,h,h)
tracemem(h)

system.time({h <- as.data.table(h)
            setattr(h, 'names', make.names(names(h), unique=T))})

as.data.table, however does make a copy.


Edit - no copying version

Using @MatthewDowle's suggestion setattr(h,'class','data.frame') which will convert to data.frame by reference (no copies)

set.seed(21)
n <- 1e6
i <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
i <- c(i,i,i,i,i,i)
tracemem(i)

system.time({  
  setattr(i, 'class', 'data.frame')
  setattr(i, "row.names", c(NA_integer_,n))

  setattr(i, "names", make.names(names(i), unique=TRUE))

})
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 1
    setattr(h,"class","data.frame") should be instant, no copy at all. – Matt Dowle Sep 12 '12 at 08:07
  • @MatthewDowle -- As is `setattr(h, "class", "data.table")` ;) (Very cool, BTW). – Josh O'Brien Sep 12 '12 at 08:16
  • @JoshO'Brien Indeed :) Only realised in the last few days that `?setattr` says that `x` must be `data.table` (thanks to comment on datatable-help). `setattr` is actually intended to work on anything. Will fix docu. It returns its input too, so you can compound `[i,j,by]` afterwards if needed (say if you wrap it up into an alias: `setDT(DF)[i,j,by]`). – Matt Dowle Sep 12 '12 at 08:34
  • @MatthewDowle -- Yeah, I tried your code and was pleased to see that it accomplished the conversion to `data.frame` without making any copies. Nice hacking! – Josh O'Brien Sep 12 '12 at 08:58
  • @JoshO'Brien `setattr` is actually just a one line wrapper for R's C level `setAttrib` API function. Package `bit` has the same function, btw. It has `vecseq` too (I've just seen) which looks very handy. Might be worth reviewing `bit` to see what other gems it has (note to self). – Matt Dowle Sep 12 '12 at 09:16