How to merge (full join) a list of more than three data.tables correctly?

Question

When using merge.data.table() I get ugly error messages about duplicated columns names—strangely enough only if there are more than three data.tables. On the contrary standard merge() with data.frame versions works fine.

I use this code, to achieve a full join avoid time-split columns with merge.data.table().

So, in base R

Reduce(function(...) merge(..., all=TRUE), L)

runs fine, whereas in data.table

Reduce(function(...)
  merge(..., all=TRUE, by=Reduce(intersect, lapply(L.dt, names))), L.dt)

trows the error:

  Error in merge.data.table(..., all = TRUE, by = Reduce(intersect, lapply(L.dt,  : 
  x has some duplicated column name(s): p.l01.x,p.l02.x,p.l03.x,p.l05.x,p.l06.x,p.l07.x,
p.l08.x,p.l10.x,p.l11.x,p.l12.x,
p.l01.y,p.l02.y,p.l03.y,p.l05.y,p.l06.y,p.l07.y,
p.l08.y,p.l10.y,p.l11.y,p.l12.y. Please remove 
or rename the duplicate(s) and try again.
In addition: Warning message:
In merge.data.table(..., all = TRUE, by = Reduce(intersect, lapply(L.dt,  :

 Error in merge.data.table(..., all = TRUE, by = Reduce(intersect, lapply(L.dt,  : 
  x has some duplicated column name(s): p.l01.x,p.l02.x,p.l03.x,p.l05.x,p.l06.x,p.l07.x,
p.l08.x,p.l10.x,p.l11.x,p.l12.x,
p.l01.y,p.l02.y,p.l03.y,p.l05.y,p.l06.y,p.l07.y,p.l08.y,p.l10.y,p.l11.y,p.l12.y. Please remove
or rename the duplicate(s) and try again.

Note that the issue appears to be cumulative somehow...

While

Reduce(function(...)
  merge(..., all=TRUE, by=Reduce(intersect, lapply(L.dt[1:2], names))), L.dt[1:2])

Reduce(function(...)
  merge(..., all=TRUE, by=Reduce(intersect, lapply(L.dt[1:3], names))), L.dt[1:3])

Reduce(function(...)
  merge(..., all=TRUE, by=Reduce(intersect, lapply(L.dt[1:4], names))), L.dt[1:4])

still runs fine,

Reduce(function(...)
  merge(..., all=TRUE, by=Reduce(intersect, lapply(L.dt[1:5], names))), L.dt[1:5])

throws a warning, and from here

Reduce(function(...)
  merge(..., all=TRUE, by=Reduce(intersect, lapply(L.dt[1:6], names))), L.dt[1:6])

finally the error starts.

Thus, merge.data.table() appears to be working fine only until the number of data.tables exceeds 3?? Any ideas how I could solve this issue?

Note: With easy example data no error appears. So it could be caused by my data. But I have no idea why—standard merge() works fine! My real data has about dim=15.000*500, some variable names match throughout the list, others are individual. This MCVE seems to reproduce the issue quite well.

Data and code to reproduce

# names vector mimicking my data   
nm <- list(c("p.d17m", "p.d17", "p.d29", "p.d31", "p.n03", "p.n04", 
"p.n05", "p.d36", "p.d40", "p.d41", "p.d45", "p.d46", "p.d50", 
"p.d51", "p.d55", "p.d56", "p.d60", "p.d61", "p.d65", "p.d66", 
"p.d70", "p.d71", "p.n08"), c("p.d17m", "p.d17", "p.c44", "p.l01", 
"p.l02", "p.l03", "p.l05", "p.l06", "p.l07", "p.l08", "p.l10", 
"p.l11", "p.l12", "p.l13", "p.l15", "p.l16", "p.l17", "p.l18", 
"p.l20", "p.l21", "p.l22", "p.l23", "p.l25"), c("p.d17m", "p.d17", 
"p.c44", "p.l01", "p.l02", "p.l03", "p.l05", "p.l06", "p.l07", 
"p.l08", "p.l10", "p.l11", "p.l12", "p.l13", "p.l15", "p.l16", 
"p.l17", "p.l18", "p.l20", "p.l21", "p.l22", "p.l23", "p.l25"
), c("p.d17m", "p.d17", "p.c44", "p.l01", "p.l01r", "p.l02", 
"p.l03", "p.l05", "p.l06", "p.l07", "p.l08", "p.l10", "p.l11", 
"p.l12", "p.l13", "p.l15", "p.l16", "p.l17", "p.l18", "p.l20", 
"p.l21", "p.l22", "p.l23"), c("p.d17m", "p.d17", "p.c44", "p.l01", 
"p.l01r", "p.l02", "p.l03", "p.l05", "p.l06", "p.l07", "p.l08", 
"p.l10", "p.l11", "p.l12", "p.l13", "p.l15", "p.l16", "p.l17", 
"p.l18", "p.l20", "p.l21", "p.l22", "p.l23"), c("pdate.", "p.d17m", 
"p.d17", "p.d160", "p.d161", "p.d162", "p.d163", "p.d164", "p.d165", 
"p.d166", "p.d167", "p.c44", "p.l01", "p.l01r", "p.l02", "p.l03", 
"p.l05", "p.l06", "p.l07", "p.l08", "p.l10", "p.l11", "p.l12"
), c("pdate.", "p.d17m", "p.d17", "p.d160", "p.d161", "p.d162", 
"p.d163", "p.d164", "p.d165", "p.d166", "p.d167", "p.c44", "p.l01", 
"p.l01r", "p.l02", "p.l03", "p.l05", "p.l06", "p.l07", "p.l08", 
"p.l10", "p.l11", "p.l12"), c("pdate.", "p.d200", "p.d201", "p.d17m", 
"p.d17", "p.d160", "p.d161", "p.d162", "p.d163", "p.d164", "p.d165", 
"p.d166", "p.d167", "p.c44", "p.c47", "p.c48", "p.c49", "p.c50", 
"p.l01", "p.l01r", "p.l02", "p.l03", "p.l05"), c("pdate.", "p.d17m", 
"p.d17", "p.d160", "p.d161", "p.d162", "p.d163", "p.d164", "p.d165", 
"p.d166", "p.d167", "p.c44", "p.c47", "p.c48", "p.c49", "p.c50", 
"p.l01", "p.l01r", "p.l02", "p.l03", "p.l05", "p.l06", "p.l07"
), c("pdate.", "p.d17m", "p.d17", "p.d160", "p.d161", "p.d162", 
"p.d163", "p.d164", "p.d165", "p.d166", "p.d167", "p.c44", "p.c47", 
"p.c48", "p.c49", "p.c50", "p.l01", "p.l01r", "p.l02", "p.l03", 
"p.l05", "p.l06", "p.l07"), c("plingu.", "pdate.", "p.d17m", 
"p.d17", "p.d160", "p.d161", "p.d162", "p.d163", "p.d164", "p.d165", 
"p.d166", "p.d167", "p.c44", "p.c47", "p.c48", "p.c49", "p.c50", 
"p.c60", "p.c61", "p.c62", "p.c63", "p.c64", "p.c65"), c("hab_ch.", 
"plingu.", "pdate.", "p.d17m", "p.d17", "p.d160", "p.d161", "p.d162", 
"p.d163", "p.d164", "p.d165", "p.d166", "p.d167", "p.c44", "p.c47", 
"p.c48", "p.c49", "p.c50", "p.c60", "p.c61", "p.c62", "p.c63", 
"p.c64"), c("hab_ch.", "plingu.", "pdate.", "p.d17m", "p.d17", 
"p.d160", "p.d161", "p.d162", "p.d163", "p.d164", "p.d165", "p.d166", 
"p.d167", "p.c44", "p.c47", "p.c48", "p.c49", "p.c50", "p.c60", 
"p.c61", "p.c62", "p.c63", "p.c64"), c("hab_ch.", "plingu.", 
"pdate.", "p.d17m", "p.d17", "p.d160", "p.d161", "p.d162", "p.d163", 
"p.d164", "p.d165", "p.d166", "p.d167", "p.c44", "p.c100", "p.c101", 
"p.c102", "p.c103", "p.c47", "p.c48", "p.c49", "p.c50", "p.l01"
), c("hab_ch.", "plingu.", "pdate.", "p.d17m", "p.d17", "p.d110a", 
"p.d110b", "p.d160", "p.d161", "p.d162", "p.d163", "p.d164", 
"p.d165", "p.d166", "p.d167", "p.c44", "p.c47", "p.c48", "p.c49", 
"p.c50", "p.l01", "p.l01r", "p.l02"), c("hab_ch.", "plingu.", 
"pdate.", "p.d17m", "p.d17", "p.d110a", "p.d110b", "p.d160", 
"p.d161", "p.d162", "p.d163", "p.d164", "p.d165", "p.d166", "p.d167", 
"p.c44", "p.c47", "p.c48", "p.c49", "p.c50", "p.l01", "p.l01r", 
"p.l02"), c("hab_ch.", "plingu.", "pdate.", "p.d17m", "p.d17", 
"p.d110a", "p.d110b", "p.d160", "p.d161", "p.d162", "p.d163", 
"p.d164", "p.d165", "p.d166", "p.d167", "p.c44", "p.c100", "p.c101", 
"p.c102", "p.c103", "p.c47", "p.c48", "p.c49"), c("hab_ch.", 
"plingu.", "pdate.", "p.d17m", "p.d17", "p.d110a", "p.d110b", 
"p.d160", "p.d161", "p.d162", "p.d163", "p.d164", "p.d165", "p.d166", 
"p.d167", "p.c44", "p.c47", "p.c48", "p.c49", "p.c50", "p.l01", 
"p.l01r", "p.l02"))

n <- 10  # row numbers

fun <- function(x) {
  # fun() produces a n x 22 data.frame
  set.seed(x)
  d <- data.frame(
    cbind(id=1:n, yr=x, 
          matrix(rnorm(n*(20)), n)))
}

tmp <- setNames(lapply(1:6, fun), paste0("df", 1:6))  # produce six data frames

list2env(tmp, globalenv())  # get data frames into globalenv

# insert extra columns into some df's
df3 <- cbind(id=df3[, 1], foo=sample(1e3:2e3, nrow(df3), replace=TRUE), df3[, -1])
df4 <- cbind(id=df4[, 1], foo=sample(1e3:2e3, nrow(df4), replace=TRUE), df4[, -1])
df5 <- cbind(id=df5[, 1], foo=sample(1e3:2e3, nrow(df5), replace=TRUE), df5[, -1])
df6 <- cbind(id=df6[, 1], foo=sample(1e3:2e3, nrow(df5), replace=TRUE), df6[, -1])

# giving var names:
names(df1)[3:22] <- nm[[1]][3:22]
names(df2)[3:22] <- nm[[2]][3:22]
names(df3)[4:23] <- nm[[3]][4:23]
names(df4)[4:23] <- nm[[4]][4:23]
names(df5)[4:23] <- nm[[5]][4:23]
names(df5)[4:23] <- nm[[6]][4:23]

# list of data.frames:
L <- list(df1, df2, df3, df4, df5, df6)

# list of data.tables:
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("id", "yr")))

_{Also related: 1, 2}

Respond to the error about the duplicates: `cols <- data.table(cols=unlist(lapply(L, function(X){colnames(X)})))`; `cols[, .N, by=cols][N>1,]`. e.g. you have column `p.l01` 4 times. What do you want the output to be for this when there are 4 versions of these in the output? data.table handles some duplicates by putting `.x` and `.y` after, but it has limits. It's more robust to change the names of your data, going in to the merge, than alter them post-merge — Jonny Phelps, Nov 23 '18 at 15:18
`p.101` is always combined with `year` so there should expectedly be no problem. W/o [this answer](https://stackoverflow.com/a/53439312/6574038) I'd expect the output to be `p.101.x`, `p.101.y`. `p.101.something`, but at least this is solved in the linked answer. *"but it has limits"* - that's exactly the point. In base R it's working - in data.table not (yet). — jay.sf, Nov 23 '18 at 15:36
When I try `Reduce(function(...) merge(..., all=TRUE), L)` from your post, it removes `p.l04`. I wouldn't call that working, I'd call that a silent bug. the data.table error is at least helpful here — Jonny Phelps, Nov 23 '18 at 15:39
`p.l04` is not in the names, try `any(sapply(L, function(x) "p.l04" %in% names(x)))` — jay.sf, Nov 23 '18 at 15:51
sorry my mistake. Shouldn't there be more columns for `p.l01` in this output? `names(Reduce(function(...) merge(..., all=TRUE), L))` — Jonny Phelps, Nov 23 '18 at 16:04
The different "columns" for `p.l01` are identified by `yr`, that's what's called "long format" that I want to achieve with this munging in fact (by contrast to multiple columns in "wide format"). — jay.sf, Nov 23 '18 at 16:11
r u looking for `Reduce(function(x, y) merge(x, y, all=TRUE, by=intersect(names(x), names(y))), L.dt)` ? — chinsoon12, Nov 26 '18 at 01:01
@chinsoon12 That worked perfectly, thanks. (Runs nearly 30 times faster than with data.frames!) Could you explain why the code for data.tables needs to be adapted in this way? — jay.sf, Nov 26 '18 at 06:38

How to merge (full join) a list of more than three data.tables correctly?

Data and code to reproduce

0 Answers0