Let's say a
and b
are two data frames. The goal is to write a function
f(a,b)
that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE)
would do, that is filling missing variables in a
or b
with NAs. (The problem is merge()
appears to be very slow.)
This can be done as follows (pseudo-code):
for each variable `var` found in either `a` or `b`, do:
unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)
where:
x.srcvar is x$var if x$var exists, or else
rep(NA, nrow(x)) if y$var !is.factor, or else
as.factor(rep(NA, nrow(x)))
and then wrap everything in a data frame.
Here's a "naive" implementation:
merge.datasets1 <- function(a, b) {
a.fill <- rep(NA, nrow(a))
b.fill <- rep(NA, nrow(b))
a.fill.factor <- as.factor(a.fill)
b.fill.factor <- as.factor(b.fill)
out <- list()
for (v in union(names(a), names(b))) {
if (!v %in% names(a)) {
b.srcvar <- b[[v]]
if (is.factor(b.srcvar))
a.srcvar <- a.fill.factor
else
a.srcvar <- a.fill
} else {
a.srcvar <- a[[v]]
if (v %in% names(b))
b.srcvar <- b[[v]]
else if (is.factor(a.srcvar))
b.srcvar <- b.fill.factor
else
b.srcvar <- b.fill
}
out[[v]] <- unlist(list(a.srcvar, b.srcvar),
recursive=FALSE, use.names=FALSE)
}
data.frame(out)
}
Here's a different implementation that uses "vectorized" functions:
merge.datasets2 <- function(a, b) {
srcvar <- within(list(var=union(names(a), names(b))), {
a.exists <- var %in% names(a)
b.exists <- var %in% names(b)
a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
})
a <- within(a, {
fill <- NA
fill.factor <- factor(fill)
})
b <- within(b, {
fill <- NA
fill.factor <- factor(fill)
})
out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
recursive=FALSE, use.names=FALSE),
srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
out <- data.frame(out)
names(out) <- srcvar$var
out
}
Now we can test:
sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])
system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>> user system elapsed
>> 0.192 0.000 0.190
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>> user system elapsed
>> 2.292 0.000 2.293
So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for
loops are slow, and that one should
rather use lapply
and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.