3

This is a higher level data.table question about how the columns are ordered after a join or merge is performed. Right now, it is easy to work around the issues, but I am trying to understand why it is happening and get a better feeling for data.table.

Here is a minimal example.

library(data.table)

set.seed(123)
dt_example1 <- data.table(C1 = c(1,2,3,5,6,7),
                          C2 = c("A","A","B","C","A","A"))
dt_example2 <- data.table(C1 = 1:10,
                          C3 = sample(x = LETTERS,size = 10,replace = T))

setkey(dt_example1,
       C1)
setkey(dt_example2,
       C1)

dt_example2[J(dt_example1)]

   C1 C3 C2
1:  1  H  A
2:  2  U  A
3:  3  K  B
4:  5  Y  C
5:  6  B  A
6:  7  N  A

merge(x = dt_example1,
      y = dt_example2,
      all.x = TRUE)

   C1 C2 C3
1:  1  A  H
2:  2  A  U
3:  3  B  K
4:  5  C  Y
5:  6  A  B
6:  7  A  N

I am drawing from the example provided here, which states that Y[X] is equivalent to merge(x = X,y = Y,all.x = TRUE) (i.e. a left outer join). When performing them, the same data is returned for both queries. However, the order of columns is different.

I am aware of the answer posted here that explains how merge() performs the joining in both directions. My question is both why the column order would be different, and how one could ensure that the column order is the same, which that question does not address.

I can understand conceptually how, when doing Y[X], the non-key columns in Y might come first (since it is, after all, Y followed by [X]). Similarly, with the merge(), I could see how the function might re-order the output to ensure that variables appear in the order of "by variable, x variables, y variables".

Why is the column order different between these two methods? Is there any way to ensure that the column order returned by merge() and by joins using J() is the same?

Community
  • 1
  • 1
TARehman
  • 6,659
  • 3
  • 33
  • 60

1 Answers1

4

Note that reversing the order of the merge and using the all.y=TRUE instead of all.x=TRUE to ensure the same rows will be kept, results in the left join with the exact same column order:

merge(x = dt_example2, y = dt_example1, all.y = TRUE, by="C1")
   C1 C3 C2
1:  1  H  A
2:  2  U  A
3:  3  K  B
4:  5  Y  C
5:  6  B  A
6:  7  N  A

The way I look at the columns ordering is that the data.table on the left side serves as the "base" to which additional columns are added using the data.table on the right. This holds for both the join and the merge methods.

In terms of row selection, the data.table on the right data.table serves to subset the rows of the "base" data.table on the left in the join method. In the merge method, this row subsetting is determined by the "all" family of arguments.

So, assure that the methods will return the same result:

  1. put the same data.table on the left ("x" argument for merge) and
  2. use the all.y=TRUE argument in merge.
lmo
  • 37,904
  • 9
  • 56
  • 69