0

I have a data frame supposed to grow (adding rows) during runtime. It is wise to pre-allocate the data frame beforehand (cmp. The R Inferno). The pre-allocation routine should accept all kinds of data frame composition (i.e. number of columns and column classes).

Example

arbitraryDf<-function(){
    return(data.frame(C="char",L=TRUE,N=4.5,stringsAsFactors=FALSE))
}

returns an arbitrary data frame to use as template. I will need N <- 10 rows, so I might do:

data<-as.data.frame(lapply(arbitraryDf(),function(x){eval(parse(text=paste(class(x),"(",N,")")))}),stringsAsFactors=FALSE)

which returns the desired data frame.

>data
   C     L N
1    FALSE 0
2    FALSE 0
3    FALSE 0
4    FALSE 0
5    FALSE 0
6    FALSE 0
7    FALSE 0
8    FALSE 0
9    FALSE 0
10   FALSE 0
>sapply(data,class)
          C           L           N 
"character"   "logical"   "numeric" 

Needless to say, the use of eval() is ugly. Is there a more straightforward solution to this?

As said, the routine needs to accept any data frame composition, otherwise @mnel's answer was good enough.

Update

Essentially, I would like to achieve the same as

data <- data.frame(x= numeric(N), y= integer(N), z = character(N)) 

but in a generic way, for any df layout. The info of the df layout should be drawn from a given df (here arbitraryDf())

Community
  • 1
  • 1
Janhoo
  • 597
  • 5
  • 21
  • Are you looking for something like `do.call(rbind, replicate(N, arbitraryDf(), FALSE))`? What are you actually trying to achieve? – A5C1D2H2I1M1N2O1R2T1 Jun 27 '14 at 09:17
  • @Ananda As said, the goal is to allocate a data frame. Column number and classes will not be known before runtime. Your solution is nice, but also replicates the the content of the data.frame – Janhoo Jun 27 '14 at 09:19
  • I would be using the `vector` function to allocate, but you don't provide enough detail to advise properly, your use-case is not well explained. I don't see there being any difference in `rbind`ing a complete row as opposed to `[<-` subset and replace. If you *had* to preallocate I'd be looking at e.g. `N <- 10;df <- data.frame( n = vector("numeric",N) , c = vector("character",N) )` – Simon O'Hanlon Jun 27 '14 at 09:27

2 Answers2

3

I am not sure if this what you are looking for. The function gendf takes two arguments -- the template data frame and the number of rows. It returns empty data frame according to the template with given number of rows.

arbitraryDf <- data.frame(C = "char", L = T, N = 4.5, stringsAsFactors = F)
arbitraryDf

gendf <- function(df, N) {
  # Create list of modes
  modes <- lapply(df, storage.mode)
  # Return data.frame
  return(data.frame(lapply(modes, vector, N)))
}

x <- gendf(arbitraryDf, 10)
class(x)
djhurio
  • 5,437
  • 4
  • 27
  • 48
2

To sort of generalize Simon's comment, perhaps something like this would be of use to you:

myFun <- function(sourceDF, length) {
  Classes <- sapply(sourceDF, class)
  data.frame(lapply(Classes, vector, length = length))
}

Here, we first extract the classes of each column of the source data.frame and use that as a template for the new data.frame, whose length is determined by the length argument.

Example:

myFun(arbitraryDf(), 5)
#   C     L N
# 1   FALSE 0
# 2   FALSE 0
# 3   FALSE 0
# 4   FALSE 0
# 5   FALSE 0
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • +1, *but* is there really a performance enhancement from preallocating and then (I assume) subset and replace rows, rather than just rbinding them? I doubt it (but I haven't tested that). – Simon O'Hanlon Jun 27 '14 at 11:05
  • @Ananda. Very nice. Just to be sure, the return value should read: `data.frame(...,stringsAsFactors=FALSE)`, since for factors you wouldn't be able to insert any values without adding new levels – Janhoo Jun 27 '14 at 11:44
  • Yes, @Simon, definitely & drastically. Refer to [R-Inferno p12 for timings](http://www.burns-stat.com/documents/books/the-r-inferno/) – Janhoo Jun 27 '14 at 11:49
  • @Jan, I don't think that Simon is questioning the value of preallocation, but whether this is the best solution to your problem (which I still don't fully understand). – A5C1D2H2I1M1N2O1R2T1 Jun 27 '14 at 13:12
  • @Ananda, maybe my fault. It goes without saying, above is just a play example -- maybe not a good one. The did `rbind`ing and now improve by allocating and subsetting. It's alot faster. Especially on big datasets. Thanks. – Janhoo Jun 27 '14 at 14:04