a faster implementation of merge.data.frame() in R

Question

Let's say a and b are two data frames. The goal is to write a function f(a,b) that produces a merged data frame, in the same way as merge merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)

This can be done as follows (pseudo-code):

for each variable `var` found in either `a` or `b`, do:
    unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)

where:
x.srcvar is x$var if x$var exists, or else
            rep(NA, nrow(x)) if y$var !is.factor, or else
            as.factor(rep(NA, nrow(x)))

and then wrap everything in a data frame.

Here's a "naive" implementation:

merge.datasets1 <- function(a, b) {
  a.fill <- rep(NA, nrow(a))
  b.fill <- rep(NA, nrow(b))
  a.fill.factor <- as.factor(a.fill)
  b.fill.factor <- as.factor(b.fill)
  out <- list()
  for (v in union(names(a), names(b))) {
    if (!v %in% names(a)) {
      b.srcvar <- b[[v]]
      if (is.factor(b.srcvar))
        a.srcvar <- a.fill.factor
      else
        a.srcvar <- a.fill
    } else {
      a.srcvar <- a[[v]]
      if (v %in% names(b))
        b.srcvar <- b[[v]]
      else if (is.factor(a.srcvar))
        b.srcvar <- b.fill.factor
      else
        b.srcvar <- b.fill
    }
    out[[v]] <- unlist(list(a.srcvar, b.srcvar),
                       recursive=FALSE, use.names=FALSE)
  }
  data.frame(out)
}

Here's a different implementation that uses "vectorized" functions:

merge.datasets2 <- function(a, b) {
  srcvar <- within(list(var=union(names(a), names(b))), {
    a.exists <- var %in% names(a)
    b.exists <- var %in% names(b)
    a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
    b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
    a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
    b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
  })
  a <- within(a, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  b <- within(b, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
                                     recursive=FALSE, use.names=FALSE),
                srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
  out <- data.frame(out)
  names(out) <- srcvar$var
  out
}

Now we can test:

sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])

system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>>   user  system elapsed 
>>  0.192   0.000   0.190 
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>>   user  system elapsed 
>>  2.292   0.000   2.293

So, the naive version is orders of magnitude faster than the other. How can this be? I always thought that for loops are slow, and that one should rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.

For loops are not inherently slow, IFF you use them sensible, in particular if you pre-allocate memory, don't expand the object while in the loop, etc. — Andrie, Oct 29 '12 at 14:08
@JoshuaUlrich `data.table` doesn't work because its `merge` method doesn't include non-common variables in the resulting data.table. — Ernest A, Oct 29 '12 at 14:41
Have you looked at this: http://stackoverflow.com/questions/4322219/whats-the-fastest-way-to-merge-join-data-frames-in-r — screechOwl, Oct 29 '12 at 14:52
I'd be willing to bet that there's a way to do what you want with `data.table`. — joran, Oct 29 '12 at 14:59
What would be wrong with `merge(DT, DF, by = intersect(names(DT), names(DF)), all = TRUE)` — mnel, Oct 29 '12 at 21:49
@mnel it's wrong because i don't want intersect() but union(). try this with union() and it fails. — Ernest A, Oct 30 '12 at 22:27
This explains the plyr method. I didn't benchmark this specific code but it's typically one of the faster options. In plyr it's a `full join`: http://stackoverflow.com/a/9652931/168689 — Rob, Oct 31 '12 at 06:42
There's also the sqldf option. It uses the sqlite engine so it's fast and if you know sql the syntax is easy. In sqldf it's a `outer join`: http://stackoverflow.com/a/4483202/168689 — Rob, Oct 31 '12 at 06:50
@mnel I'm stacking data.frames and filling the missing columns in either data.frame with NAs. — Ernest A, Oct 31 '12 at 16:40
@ErnestA `rbind.fill.matrix(...)` and `cbind.fill.matrix(...)` in the `plyr` package will fill NA's in the columns. — Rob, Oct 31 '12 at 17:19

mnel · Accepted Answer · 2012-11-02T02:16:51.553

In fact, you are not doing trying to replicate merge(a,b, all = TRUE) at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling with NA where a column does not exist.

 # note  that this is not what you want/
dim(merge(sample.datasets[[1]], sample.datasets[[2]], all = T))
 [1] 314   5

The reason merge(a,b, all = TRUE) will be slow is that it defaults to merging by the intersection of the names. If you convert to data.tables then the merge.data.table method is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)

An easy solution is to use rbind.fill from the plyr package.

library(plyr)
system.time({.x <- Reduce(rbind.fill, sample.datasets)})
## user  system elapsed 
## 0.16    0.00    0.15 
# which is almost identical to
system.time(.old <- Reduce(merge.datasets1, sample.datasets))
##   user  system elapsed 
##   0.14    0.00    0.14

EDIT 2-11-2012

On further consideration it is really useful to note that you can pass a list of data.frames to rbind.fill so

 system.time(super_fast <- rbind.fill(sample.datasets))
 ##  user  system elapsed 
 ##  0.02    0.00    0.02 

identical(super_fast, .old)
[1] TRUE

The majority of time spent in the overheads for Reduce, which rbind.fill does not require.

a faster implementation of merge.data.frame() in R

1 Answers1

EDIT 2-11-2012