50

Most of the questions about merging data.frame in lists on SO don't quite relate to what I'm trying to get across here, but feel free to prove me wrong.

I have a list of data.frames. I would like to "rbind" rows into another data.frame by row. In essence, all first rows form one data.frame, second rows second data.frame and so on. Result would be a list of the same length as the number of rows in my original data.frame(s). So far, the data.frames are identical in dimensions.

Here's some data to play around with.

sample.list <- list(data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)))

Here's what I've come up with with the good ol' for loop.

#solution 1
my.list <- vector("list", nrow(sample.list[[1]]))
for (i in 1:nrow(sample.list[[1]])) {
    for (j in 1:length(sample.list)) {
        my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])
    }
}

#solution 2 (so far my favorite)
sample.list2 <- do.call("rbind", sample.list)
my.list2 <- vector("list", nrow(sample.list[[1]]))

for (i in 1:nrow(sample.list[[1]])) {
    my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ]
}

Can this be improved using vectorization without much brainhurt? Correct answer will contain a snippet of code, of course. "Yes" as an answer doesn't count.

EDIT

#solution 3 (a variant of solution 2 above)
ind <- rep(1:nrow(sample.list[[1]]), times = length(sample.list))
my.list3 <- split(x = sample.list2, f = ind)

BENCHMARKING

I've made my list larger with more rows per data.frame. I've benchmarked the results which are as follows:

#solution 1
system.time(for (i in 1:nrow(sample.list[[1]])) {
    for (j in 1:length(sample.list)) {
        my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])
    }
})
   user  system elapsed 
 80.989   0.004  81.210 

# solution 2
system.time(for (i in 1:nrow(sample.list[[1]])) {
    my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ]
})
   user  system elapsed 
  0.957   0.160   1.126 

# solution 3
system.time(split(x = sample.list2, f = ind))
   user  system elapsed 
  1.104   0.204   1.332 

# solution Gabor
system.time(lapply(1:nr, bind.ith.rows))
   user  system elapsed 
  0.484   0.000   0.485 

# solution ncray
system.time(alply(do.call("cbind",sample.list), 1,
                .fun=matrix, ncol=ncol(sample.list[[1]]), byrow=TRUE,
                dimnames=list(1:length(sample.list),names(sample.list[[1]]))))
   user  system elapsed 
 11.296   0.016  11.365
smci
  • 32,567
  • 20
  • 113
  • 146
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • Why o why did I forget about split? Very nice solution! – Joris Meys Feb 01 '11 at 15:46
  • Very nice demonstration. This kind of situation was one of the few where I still tend to use for loops, but pretty clear why that's a bad idea :) – J. Win. Feb 01 '11 at 16:13
  • @jonw, I guess it depends on what you're after. If you have medium or smallish data sets, loops are fine. – Roman Luštrik Feb 01 '11 at 17:20
  • How about this merged.list = do.call('rbind', sample.list) – Omar Wagih Oct 03 '12 at 18:24
  • Unfortunately this just merges the list into one big data.frame. This is the intermediate step I use in my solution #2. – Roman Luštrik Oct 03 '12 at 20:06
  • Ah, my bad. Misread the question – Omar Wagih Oct 04 '12 at 02:05
  • @DWin Hi. I don't follow your new bounty, what's it for? – Matt Dowle Dec 31 '12 at 08:23
  • 2
    I had read that it was possible to award bounties even if the question was already answered. So I decided to go looking for good answers. This one is posing a bit of a problem for me however, because mnel's DT solution appears better than the accepted solution that I originally planned to award. – IRTFM Dec 31 '12 at 08:41
  • @DWin Interesting, nice idea. I didn't realise somebody other than the asker could award bounty, either. – Matt Dowle Dec 31 '12 at 10:08

4 Answers4

48

Try this:

bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))
nr <- nrow(sample.list[[1]])
lapply(1:nr, bind.ith.rows)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
40

A couple of solutions that will make this quicker using data.table

EDIT - with larger dataset showing data.table awesomeness even more.

# here are some sample data 
sample.list <- replicate(10000, data.frame(x = sample(1:100, 10), 
  y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)), simplify = F)

Gabor's fast solution:

# Solution Gabor
bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))
nr <- nrow(sample.list[[1]])
system.time(rowbound <- lapply(1:nr, bind.ith.rows))

##    user  system elapsed 
##   25.87    0.01   25.92 

The data.table function rbindlist will make this even quicker even when working with data.frames)

library(data.table)
fastbind.ith.rows <- function(i) rbindlist(lapply(sample.list, "[", i, TRUE))
system.time(fastbound <- lapply(1:nr, fastbind.ith.rows))

##    user  system elapsed 
##   13.89    0.00   13.89 

A data.table solution

Here is a solution that uses data.tables - it is split solution on steroids.

# data.table solution
system.time({
    # change each element of sample.list to a data.table (and data.frame) this
    # is done instaneously by reference
    invisible(lapply(sample.list, setattr, name = "class", 
               value = c("data.table",  "data.frame")))
    # combine into a big data set
    bigdata <- rbindlist(sample.list)
    # add a row index column (by refere3nce)
    index <- as.character(seq_len(nr))
    bigdata[, `:=`(rowid, index)]
    # set the key for binary searches
    setkey(bigdata, rowid)
    # split on this -
    dt_list <- lapply(index, function(i, j, x) x[i = J(i)], x = bigdata)
    # if you want to drop the `row id` column
    invisible(lapply(dt_list, function(x) set(x, j = "rowid", value = NULL)))
    # if you really don't want them to be data.tables run this line
    # invisible(lapply(dt_list, setattr,name = 'class', value =
    # c('data.frame')))
})
################################
##    user  system elapsed    ##
##    0.08    0.00    0.08    ##
################################

How awesome is data.table!

Caveat user with rbindlist

rbindlist is fast because it does not perform the checking that do.call(rbind,....) will. For example it assumes that any factor columns have the same levels as in the first element of the list.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mnel
  • 113,303
  • 27
  • 265
  • 254
5

Here's my attempt with plyr, but I like G. Grothendieck's approach:

library(plyr)
alply(do.call("cbind",sample.list), 1, .fun=matrix,
        ncol=ncol(sample.list[[1]]), byrow=TRUE,
        dimnames=list(1:length(sample.list),
        names(sample.list[[1]])
      ))
smci
  • 32,567
  • 20
  • 113
  • 146
ncray
  • 1,050
  • 9
  • 4
1

addind tidyverse solution:

library(tidyverse)
bind_rows(sample.list)
cephalopod
  • 1,826
  • 22
  • 31