12

Due to time constraints, I've decided to use data tables in my code instead of data frames, as they are much faster. However, I still want the functionality of data frames. I need to merge two data tables, conserving all values (like setting all=TRUE in merge).

Some example code:

> x1 = data.frame(index = 1:10)
> y1 = data.frame(index = c(2,4,6), weight = c(.2, .5, .3))
> x1
   index
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
> y1
  index weight
1     2    0.2
2     4    0.5
3     6    0.3

> merge(x,y, all=TRUE)
      index weight
 [1,]     1     NA
 [2,]     2      1
 [3,]     3     NA
 [4,]     4      2
 [5,]     5     NA
 [6,]     6      3
 [7,]     7     NA
 [8,]     8     NA
 [9,]     9     NA
[10,]    10     NA

Now can I do a similar thing with data tables? (The NA's don't necessarily have to stay, I change them to 0's anyways).

> x2 = data.table(index = 1:10, key ="index")
> y2 = data.table(index = c(2,4,6), weight= c(.3,.5,.2))

I know you can merge, but I also know that there is a faster way.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Mike Flynn
  • 1,025
  • 2
  • 12
  • 29
  • I've pondered this in the past and never found an answer. I don't think this is addressed in the [FAQ](http://datatable.r-forge.r-project.org/datatable-faq.pdf) either, or I'm too dense to see where it's covered. I've always resorted to using `merge()` when I needed to do this... – Chase Jul 11 '12 at 15:21
  • Also, `merge.data.table` has been sped up in recent versions. So the penalty of using `merge()` vs `X[Y]` is (well, should be) much less now. Type `data.table:::merge.data.table` to see how it works using `X[Y]` internally. – Matt Dowle Jul 11 '12 at 15:49
  • @MatthewDowle - Ahh yes, I had read over that a few times but it didn't really click. Seeing this example and the answer below brings everything full circle. Cheers. – Chase Jul 11 '12 at 16:20

2 Answers2

8

so following on from Translating SQL joins on foreign keys to R data.table syntax

x2 = data.table(index = 1:10, key ="index")
y2 = data.table(index = c(2,4,6), weight= c(.3,.5,.2),key="index")
y2[J(x2$index)]
Community
  • 1
  • 1
shhhhimhuntingrabbits
  • 7,397
  • 2
  • 23
  • 23
  • 3
    yup, just figured that out a second ago actually. Actually y2[x2] works just fine as well, but I had disregarded it before because x2[y2] did not preserve all of them and I thought they were the same. Thanks! – Mike Flynn Jul 11 '12 at 15:41
  • And how would you do an outer join on both sides; i.e. what if: y2 = data.table(index = c(2,4,6,7,8,11), weight= c(.3,.5,.2), key = "index") ? I would like to get all 11 index values as an output as merge(..., all=TRUE) would give. – nassimhddd Jul 11 '12 at 16:34
  • I think you could try just plain old merge() with all = TRUE, because it uses the faster merge.data.table() function. However, I haven't gotten it to work properly yet, so I will get back to this. – Mike Flynn Jul 12 '12 at 13:52
1

I use a function like:

mergefast<-function(x,y,by.x,by.y,all) {
  x_dt<-data.table(x)
  y2<-y
  for (i in 1:length(by.y)) names(y2)[grep(by.y[i],names(y2))]<-by.x[i]
  y_dt<-data.table(y2)
  setkeyv(x_dt,by.x)
  setkeyv(y_dt,by.x)
  as.data.frame(merge(x_dt,y_dt,by=by.x,all=all))
}

which can be used in your example as:

mergefast(x1,y1,by.x="index",by.y="index",all=T)

It's a bit lacking in features that merge has, e.g. by, all.x, all.y, but these can be easily incorporated.

uday
  • 6,453
  • 13
  • 56
  • 94