3

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.

Here's a small example:

#Libraries
library("reshape")
library("data.table")

#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
                 data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
                 data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))

#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))

#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here

Reduce(function(x, y) x[y[J(unique_keys)]], filelist)

Here's my expected output:

> listMerged
   x y  a  b  c
1  1 1  1  0  0
2  1 2  2  5 10
3  1 3  3 10 20
4  2 1  4 15 30
5  2 2  5 20 40
6  2 3  6 25 50
7  3 1  7 30 60
8  3 2  8 35 70
9  3 3  9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80

Here are my resources:

3442
  • 8,248
  • 2
  • 19
  • 41
itpetersen
  • 1,475
  • 3
  • 13
  • 32

1 Answers1

3

This worked for me:

library("reshape")
library("data.table")
##
filelist <- list(
  data.frame(
    x=c(1,1,1,2,2,2,3,3,3), 
    y=c(1,2,3,1,2,3,1,2,3), 
    a=1:9),
  data.frame(
    x=c(1,1,1,2,2,2,3,3,4), 
    y=c(1,2,3,1,2,3,1,2,1), 
    b=seq(from=0, by=5, length.out=9)),
  data.frame(
    x=c(1,1,1,2,2,2,3,3,4), 
    y=c(1,2,3,1,2,3,1,2,2), 
    c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
  setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
  merge(x,y,all=T,allow.cartesian=T)
},dtList)
    x y  a  b  c
 1: 1 1  1  0  0
 2: 1 2  2  5 10
 3: 1 3  3 10 20
 4: 2 1  4 15 30
 5: 2 2  5 20 40
 6: 2 3  6 25 50
 7: 3 1  7 30 60
 8: 3 2  8 35 70
 9: 3 3  9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80

Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in

> dt
       filelist
1: <data.frame>
2: <data.frame>
3: <data.frame>

which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?

Reduce(function(x, y) x[y[J(unique_keys)]], filelist)

I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)

Error in eval(expr, envir, enclos) : could not find function "J"

which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by @Arun in this answer.

Community
  • 1
  • 1
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • 1
    Great, this solved my question perfectly. My attempt at the Reduce() function didn't work for me either. I just had to make an informed guess based on what I could piece together from different sources. Your code works much better, thanks! – itpetersen Oct 31 '14 at 22:16
  • You're welcome; FWIW you weren't too far off with your `Reduce` function, I just had to make a couple small changes. – nrussell Oct 31 '14 at 22:20