Merge multiple data tables with duplicate column names

Question

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

I have a workaround, where, if I change the 2nd column name for DT1:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?

possible duplicate of http://stackoverflow.com/questions/28378637/reduce-in-r-over-similar-variable-names-causing-error — pcantalupo, Sep 11 '15 at 15:31
It looks like it happens because the `Reduce` function can't give a new column name. In other words, you run out of combinations of `x` and `y` (there are only 4 possible combinations so the error occurs when working with a 5th column). — under_the_sea_salad, Sep 11 '15 at 15:34
@Frank Yes, but I can use setnames() after I get a merged data table (I read the 5 files always in a particular order), so that's not an issue for me. — srao, Sep 11 '15 at 15:37
@srao - that's a bad idea - do the `setnames` before/during the join, not after — eddi, Sep 11 '15 at 15:39
Can the `x` vector contain repeated values? Is it identical in all `DT`s? — Frank, Sep 11 '15 at 15:39
@Frank yes, `x` is exactly the same in all DTs. There are no repeated values. — srao, Sep 11 '15 at 15:42
@eddi setnames() during join would be my choice too, but how can I do that? — srao, Sep 11 '15 at 15:44
@srao if `x` is exactly the same for all of them, you shouldn't be merging — eddi, Sep 11 '15 at 16:17
@eddi you're right, I can do `cbind` when `x` is exactly the same. I started using merge because I thought it would take care of any missing rows in some of the DTs, so yes, to clarify, `x` might not be the same in all DTs, but definitely has no repeated values. Sorry for the confusion! — srao, Sep 11 '15 at 16:34
@Frank see my comment above regarding `x`, sorry for mis-stating the case — srao, Sep 11 '15 at 16:35
Okay, thanks for clarifying. Although it's best to make your question representative of the problem you're actually facing, I think you got lucky this time and all the answers happen to still work in that case, except maybe the first part of Jaap's. — Frank, Sep 11 '15 at 16:40
To be clear, you want multiple y columns with automatic names? Like `y.1 y.2 y.3 y.4 y.5`? And given you know that only column `y` is duplicated. So we don't need to auto-detect duplicate columns, and auto-rename/number them. — smci, May 04 '18 at 01:18
[`data.table::merge()` has an argument `suffixes = c(".x", ".y")`](https://www.rdocumentation.org/packages/data.table/versions/1.10.4-2/topics/merge), so you can avoid the rename step, just pass the corresponding number suffix already. — smci, May 04 '18 at 01:21

Jaap · Answer 1 · 2015-09-11T17:26:01.410

If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

Or as @Frank said in the comments:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

which gives:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

This gives the same result as:

mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

> identical(mergedDT1,mergedDT2)
[1] TRUE

When your x columns do not have the same values, a nested join will not give the desired solution:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

this gives:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

While:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

gives:

   x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20

Used data:

In order to make the code with Reduce work, I changed the names of the y columns.

DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)

DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")

Nothing surprising about it working on the toy example. The two will diverge once you add values of `x` that are not all identical between all 5. `merge` with `all = TRUE` does an outer join, whereas `[` does a one-sided one. — eddi, Sep 11 '15 at 15:57
@eddi True, but the OP said in the comments that `x` is exactly the same for all DT's. I've updated my answer to show that with non-indetical `x` values this does not work. — Jaap, Sep 11 '15 at 16:14
ok, though if `x` is exactly the same, doing a merge is silly — eddi, Sep 11 '15 at 16:16

score 7 · Accepted Answer · answered Sep 11 '15 at 16:09

7

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19

answered Sep 11 '15 at 16:09

eddi

49,088
6
104
155

What's up with the parentheses around and following the function definition, like `(function()...)()`? – Frank Sep 11 '15 at 16:27
2

@Frank it's a closure the outer function creates an environment, and returns the inner function, which is what those parentheses extract – eddi Sep 11 '15 at 16:35

Frank · Answer 3 · 2015-09-11T16:20:12.683

stack and reshape I don't think this maps exactly to the merge function but...

mycols <- "x"
DTlist <- list(DT1,DT2,DT3,DT4,DT5)

dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))

#    x  1  2  3  4  5
# 1: a 10 11 12 13 14
# 2: b 11 12 13 14 15
# 3: c 12 13 14 15 16
# 4: d 13 14 15 16 17
# 5: e 14 15 16 17 18
# 6: f 15 16 17 18 19

I have no sense for if this would extend to having more columns than y.

merge-assign

DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
  lapply(DTlist,`[.noquote`,mycols))

for (k in seq_along(DTlist)){
  js = setdiff( names(DTlist[[k]]), mycols )
  DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
}

#    x y.1 y.2 y.3 y.4 y.5
# 1: a  10  11  12  13  14
# 2: b  11  12  13  14  15
# 3: c  12  13  14  15  16
# 4: d  13  14  15  16  17
# 5: e  14  15  16  17  18
# 6: f  15  16  17  18  19

(I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of merge. In the OP's case, with mycols="x" and x being the same across all DT*, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)

score 3 · Answer 4 · answered Sep 11 '15 at 15:46

3

Using reshaping gives you a lot more flexibility in how you want to name your columns.

library(dplyr)
library(tidyr)

list(DT1, DT2, DT3, DT4, DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste("y", source, sep = ".")) %>%
  spread(source, y)

Or, this would work

library(dplyr)
library(tidyr)

list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste(source, "y", sep = ".")) %>%
  spread(source, y)

answered Sep 11 '15 at 15:46

bramtayl

4,004
2
11
18

After the `bind_rows` there is no `source` column, so I see the `Error: cannot coerce type 'closure' to vector of type 'character'` (since `source` is a function). Not sure the fix for this... presumably you're misusing `bind_rows`...? – Frank Sep 11 '15 at 16:03
The .id feature is new in dplyr 0.4.3. Is that the version you are using? – bramtayl Sep 11 '15 at 16:29
Nope, 0.4.2. That must explain it. Thanks. – Frank Sep 11 '15 at 16:33

Henrik Seidel · Answer 5 · 2016-03-03T12:41:50.640

Another way of doing this:

dts <- list(DT1, DT2, DT3, DT4, DT5)

names(dts) <- paste("y", seq_along(dts), sep="")
data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")

#   x y1 y2 y3 y4 y5
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19

The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.

Veerendra Gadekar · Answer 6 · 2015-09-11T16:38:15.467

1

Alternatively you could setNames for the columns before and do merge like this

dts = list(DT1, DT2, DT3, DT4, DT5)
names(dts) = paste('DT', c(1:5), sep = '')    

dtlist = lapply(names(dts),function(i) 
         setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))

Reduce(function(...) merge(..., all = T), dtlist)

#   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
#1: a    10    11    12    13    14
#2: b    11    12    13    14    15
#3: c    12    13    14    15    16
#4: d    13    14    15    16    17
#5: e    14    15    16    17    18
#6: f    15    16    17    18    19

edited Sep 11 '15 at 16:38

answered Sep 11 '15 at 16:14

Veerendra Gadekar

4,452
19
24

Fyi, it is not necessary to give names to `dts`; you can already refer to them by `1:5`. Also, you probably want a `for` loop with `setnames`, not `setNames` (mentioned by eddi in a comment on the q and used by the op). – Frank Sep 11 '15 at 16:15
1

@Frank yes thanks! I just thought to include that step thinking if OP may want to put data.table names instead of just numbers, in that way the ultimate data.table would be more informative right – Veerendra Gadekar Sep 11 '15 at 16:18
@Frank I didn't understand why for loop? – Veerendra Gadekar Sep 11 '15 at 16:23
1

The `data.table` function `setnames` operates by reference, modifying the object itself, so there is no need to assign the function's value/result. – Frank Sep 11 '15 at 16:24
1

@Frank Oh, that's when I am using `data.table`'s `setnames` right.. OK cool! – Veerendra Gadekar Sep 11 '15 at 16:28

score 0 · Answer 7 · edited Jan 23 '23 at 10:03

This is an alternative solution - you can define join columns each time (when your x columns do not have the same values) . You need to define vectors with column names. Then you may chain joining by reference like this:

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]

Merge multiple data tables with duplicate column names

7 Answers7

Linked