This is a higher level data.table
question about how the columns are ordered after a join or merge is performed. Right now, it is easy to work around the issues, but I am trying to understand why it is happening and get a better feeling for data.table
.
Here is a minimal example.
library(data.table)
set.seed(123)
dt_example1 <- data.table(C1 = c(1,2,3,5,6,7),
C2 = c("A","A","B","C","A","A"))
dt_example2 <- data.table(C1 = 1:10,
C3 = sample(x = LETTERS,size = 10,replace = T))
setkey(dt_example1,
C1)
setkey(dt_example2,
C1)
dt_example2[J(dt_example1)]
C1 C3 C2
1: 1 H A
2: 2 U A
3: 3 K B
4: 5 Y C
5: 6 B A
6: 7 N A
merge(x = dt_example1,
y = dt_example2,
all.x = TRUE)
C1 C2 C3
1: 1 A H
2: 2 A U
3: 3 B K
4: 5 C Y
5: 6 A B
6: 7 A N
I am drawing from the example provided here, which states that Y[X]
is equivalent to merge(x = X,y = Y,all.x = TRUE)
(i.e. a left outer join). When performing them, the same data is returned for both queries. However, the order of columns is different.
I am aware of the answer posted here that explains how merge()
performs the joining in both directions. My question is both why the column order would be different, and how one could ensure that the column order is the same, which that question does not address.
I can understand conceptually how, when doing Y[X]
, the non-key columns in Y
might come first (since it is, after all, Y
followed by [X]
). Similarly, with the merge()
, I could see how the function might re-order the output to ensure that variables appear in the order of "by variable, x variables, y variables".
Why is the column order different between these two methods? Is there any way to ensure that the column order returned by merge()
and by joins using J()
is the same?