9

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.

The situation can be simulated like this:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

I've set the parameters (of the randomization) so that they approximate my true situation.

Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:

system.time(
result<-do.call(rbind, someParts)
)

Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:

   user  system elapsed 
   5.61    0.00    5.62

Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
  • In my work, I combined a list of dataframes using a technique from Dominik here http://stackoverflow.com/questions/7224938/can-i-rbind-be-parallelized-in-r/8071176#8071176 which is relatively faster than do.call the bigger it is, and found even better performance when I read the original list data in with characters instead of factors. Using rbind spent a lot of time on match; I'm speculating it's to check for factor levels to add. – ARobertson Nov 29 '12 at 21:05

6 Answers6

15

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636 

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813  

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091 
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • 1
    @Aaron : The data is a simulation, the question of OP starts with the dataframes. – Joris Meys May 12 '11 at 16:10
  • @Joris: it's close; you could extract each type into its own list of matrices, `rbind` each type-list, then create a data.frame. – Joshua Ulrich May 12 '11 at 16:12
  • @Joris: True, this doesn't answer the poster's specific question (how do I speed up `rbind.data.frame`?). But maybe with the knowledge that rbinding matrices is faster he can rewrite his code to avoid using data frames, or convert to data frames later. I'd love to see ways of actually speeding up `rbind.data.frame`. – Aaron left Stack Overflow May 12 '11 at 16:46
  • @Aaron: I think I will go with your EDIT for now (although I fear when my actual data.frame has even more columns). As I am using the fact that some columns are factors elsewhere, using matrices does not seem like an option. – Nick Sabbe May 13 '11 at 07:41
  • @Joshua: I tried `system.time(somePasMat<-lapply(someParts, data.matrix))` and it's even more slow, unfortunately. – Nick Sabbe May 13 '11 at 07:42
  • 2
    If you change `[[` to `.subset2` (which you shouldn't cause it's internal function) it runs 2x times faster. – Marek May 13 '11 at 10:26
  • 1
    @Nick: Glad you found it helpful. I wrote up something to convert to and from matrices, as I suggested at first; see my second edit. – Aaron left Stack Overflow May 13 '11 at 15:23
  • You're welcome. If speed is really an issue, you may want to think carefully about only converting to factors when necessary; there can be a lot of overhead involved. I also discovered recently that converting to a factor from an integer is faster than converting from a numeric; see my answer here for example. http://stackoverflow.com/questions/5222061/how-to-partition-a-vector-into-groups-of-neighbors-in-r/5222350#5222350 In your case you can probably get a little faster by forcing to an integer before calling factor; I've edited my answer accordingly. – Aaron left Stack Overflow May 13 '11 at 16:21
  • I should note that one should be careful when using `as.integer` as it returns the truncated value, which may not be what you want when the numeric was created with floating point arithmetic, for example, `as.integer(0.3*3+0.1)` returns `0`. Here it should be okay because the numeric was created directly from an integer (that is, the integer underlying the original factor). – Aaron left Stack Overflow May 13 '11 at 16:30
5

If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.

MLavoie
  • 9,671
  • 41
  • 36
  • 56
Daniele
  • 51
  • 1
  • 3
5

Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
3

This is ~25% faster, but there has to be a better way...

system.time({
  N <- do.call(sum, lapply(someParts, nrow))
  SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
  k <- 0
  for(i in 1:length(someParts)) {
    j <- k+1
    k <- k + nrow(someParts[[i]])
    SP[j:k,] <- someParts[[i]]
  }
})
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Building off this, I tried filling the data frame column by column by grabbing the proper column from each element with an `lapply`; it seems to be faster still. See edit to my answer. – Aaron left Stack Overflow May 12 '11 at 17:28
1

Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

0

From the ecospace package, rbind_listdf works on chunks of 100 dataframes at a time. Compared to do.call(rbind) it seems to be more time and memory efficient than if you are merging a list of several hundred dataframes. For merging 5000 dataframes of ~5GB total size, I saw peak memory use was ~25% less.

Scott Kaiser
  • 307
  • 4
  • 11