13

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

I have a workaround, where, if I change the 2nd column name for DT1:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?

smci
  • 32,567
  • 20
  • 113
  • 146
srao
  • 308
  • 1
  • 3
  • 13
  • possible duplicate of http://stackoverflow.com/questions/28378637/reduce-in-r-over-similar-variable-names-causing-error – pcantalupo Sep 11 '15 at 15:31
  • 1
    "Works fine"? Good luck with two `y.y`s, etc. – Frank Sep 11 '15 at 15:32
  • It looks like it happens because the `Reduce` function can't give a new column name. In other words, you run out of combinations of `x` and `y` (there are only 4 possible combinations so the error occurs when working with a 5th column). – under_the_sea_salad Sep 11 '15 at 15:34
  • 1
    @Frank Yes, but I can use setnames() after I get a merged data table (I read the 5 files always in a particular order), so that's not an issue for me. – srao Sep 11 '15 at 15:37
  • @bourbaki4481472 that makes sense – srao Sep 11 '15 at 15:39
  • 2
    @srao - that's a bad idea - do the `setnames` before/during the join, not after – eddi Sep 11 '15 at 15:39
  • Can the `x` vector contain repeated values? Is it identical in all `DT`s? – Frank Sep 11 '15 at 15:39
  • 1
    @Frank yes, `x` is exactly the same in all DTs. There are no repeated values. – srao Sep 11 '15 at 15:42
  • @eddi setnames() during join would be my choice too, but how can I do that? – srao Sep 11 '15 at 15:44
  • 2
    @srao if `x` is exactly the same for all of them, you shouldn't be merging – eddi Sep 11 '15 at 16:17
  • @eddi you're right, I can do `cbind` when `x` is exactly the same. I started using merge because I thought it would take care of any missing rows in some of the DTs, so yes, to clarify, `x` might not be the same in all DTs, but definitely has no repeated values. Sorry for the confusion! – srao Sep 11 '15 at 16:34
  • @Frank see my comment above regarding `x`, sorry for mis-stating the case – srao Sep 11 '15 at 16:35
  • 1
    Okay, thanks for clarifying. Although it's best to make your question representative of the problem you're actually facing, I think you got lucky this time and all the answers happen to still work in that case, except maybe the first part of Jaap's. – Frank Sep 11 '15 at 16:40
  • 1
    To be clear, you want multiple y columns with automatic names? Like `y.1 y.2 y.3 y.4 y.5`? And given you know that only column `y` is duplicated. So we don't need to auto-detect duplicate columns, and auto-rename/number them. – smci May 04 '18 at 01:18
  • [`data.table::merge()` has an argument `suffixes = c(".x", ".y")`](https://www.rdocumentation.org/packages/data.table/versions/1.10.4-2/topics/merge), so you can avoid the rename step, just pass the corresponding number suffix already. – smci May 04 '18 at 01:21

7 Answers7

9

If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

Or as @Frank said in the comments:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

which gives:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

This gives the same result as:

mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

> identical(mergedDT1,mergedDT2)
[1] TRUE

When your x columns do not have the same values, a nested join will not give the desired solution:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

this gives:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

While:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

gives:

   x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20

Used data:

In order to make the code with Reduce work, I changed the names of the y columns.

DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)

DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • this is not the same as a `merge` with `all=TRUE` – eddi Sep 11 '15 at 15:43
  • 2
    Nothing surprising about it working on the toy example. The two will diverge once you add values of `x` that are not all identical between all 5. `merge` with `all = TRUE` does an outer join, whereas `[` does a one-sided one. – eddi Sep 11 '15 at 15:57
  • 1
    Your first one is aka `Reduce(function(X,Y) X[Y], DTlist)` – Frank Sep 11 '15 at 16:10
  • @eddi True, but the OP said in the comments that `x` is exactly the same for all DT's. I've updated my answer to show that with non-indetical `x` values this does not work. – Jaap Sep 11 '15 at 16:14
  • 2
    ok, though if `x` is exactly the same, doing a merge is silly – eddi Sep 11 '15 at 16:16
  • @Frank Thanx, added it to my answer. – Jaap Sep 11 '15 at 16:24
7

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19
eddi
  • 49,088
  • 6
  • 104
  • 155
  • What's up with the parentheses around and following the function definition, like `(function()...)()`? – Frank Sep 11 '15 at 16:27
  • 2
    @Frank it's a closure the outer function creates an environment, and returns the inner function, which is what those parentheses extract – eddi Sep 11 '15 at 16:35
5

stack and reshape I don't think this maps exactly to the merge function but...

mycols <- "x"
DTlist <- list(DT1,DT2,DT3,DT4,DT5)

dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))

#    x  1  2  3  4  5
# 1: a 10 11 12 13 14
# 2: b 11 12 13 14 15
# 3: c 12 13 14 15 16
# 4: d 13 14 15 16 17
# 5: e 14 15 16 17 18
# 6: f 15 16 17 18 19

I have no sense for if this would extend to having more columns than y.

merge-assign

DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
  lapply(DTlist,`[.noquote`,mycols))

for (k in seq_along(DTlist)){
  js = setdiff( names(DTlist[[k]]), mycols )
  DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
}

#    x y.1 y.2 y.3 y.4 y.5
# 1: a  10  11  12  13  14
# 2: b  11  12  13  14  15
# 3: c  12  13  14  15  16
# 4: d  13  14  15  16  17
# 5: e  14  15  16  17  18
# 6: f  15  16  17  18  19

(I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of merge. In the OP's case, with mycols="x" and x being the same across all DT*, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)

Frank
  • 66,179
  • 8
  • 96
  • 180
3

Using reshaping gives you a lot more flexibility in how you want to name your columns.

library(dplyr)
library(tidyr)

list(DT1, DT2, DT3, DT4, DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste("y", source, sep = ".")) %>%
  spread(source, y)

Or, this would work

library(dplyr)
library(tidyr)

list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste(source, "y", sep = ".")) %>%
  spread(source, y)
bramtayl
  • 4,004
  • 2
  • 11
  • 18
  • After the `bind_rows` there is no `source` column, so I see the `Error: cannot coerce type 'closure' to vector of type 'character'` (since `source` is a function). Not sure the fix for this... presumably you're misusing `bind_rows`...? – Frank Sep 11 '15 at 16:03
  • The .id feature is new in dplyr 0.4.3. Is that the version you are using? – bramtayl Sep 11 '15 at 16:29
  • Nope, 0.4.2. That must explain it. Thanks. – Frank Sep 11 '15 at 16:33
2

Another way of doing this:

dts <- list(DT1, DT2, DT3, DT4, DT5)

names(dts) <- paste("y", seq_along(dts), sep="")
data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")

#   x y1 y2 y3 y4 y5
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19

The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.

Henrik Seidel
  • 301
  • 3
  • 3
1

Alternatively you could setNames for the columns before and do merge like this

dts = list(DT1, DT2, DT3, DT4, DT5)
names(dts) = paste('DT', c(1:5), sep = '')    

dtlist = lapply(names(dts),function(i) 
         setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))

Reduce(function(...) merge(..., all = T), dtlist)

#   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
#1: a    10    11    12    13    14
#2: b    11    12    13    14    15
#3: c    12    13    14    15    16
#4: d    13    14    15    16    17
#5: e    14    15    16    17    18
#6: f    15    16    17    18    19
Veerendra Gadekar
  • 4,452
  • 19
  • 24
  • Fyi, it is not necessary to give names to `dts`; you can already refer to them by `1:5`. Also, you probably want a `for` loop with `setnames`, not `setNames` (mentioned by eddi in a comment on the q and used by the op). – Frank Sep 11 '15 at 16:15
  • 1
    @Frank yes thanks! I just thought to include that step thinking if OP may want to put data.table names instead of just numbers, in that way the ultimate data.table would be more informative right – Veerendra Gadekar Sep 11 '15 at 16:18
  • @Frank I didn't understand why for loop? – Veerendra Gadekar Sep 11 '15 at 16:23
  • 1
    The `data.table` function `setnames` operates by reference, modifying the object itself, so there is no need to assign the function's value/result. – Frank Sep 11 '15 at 16:24
  • 1
    @Frank Oh, that's when I am using `data.table`'s `setnames` right.. OK cool! – Veerendra Gadekar Sep 11 '15 at 16:28
0

This is an alternative solution - you can define join columns each time (when your x columns do not have the same values) . You need to define vectors with column names. Then you may chain joining by reference like this:

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]
Shawn Hemelstrand
  • 2,676
  • 4
  • 17
  • 30