Merge multiple data.tables and rename columns to reflect origin

Question

As a simplified example of my problem, say I have four data.tables dt1, ..., dt4, all with the same structure:

head(dt1)
          date          x           y
 1: 2000-10-01  0.4527087 -0.11590788
 2: 2001-10-01  0.7200252 -0.55722270
 3: 2002-10-01 -1.3804472 -1.47030087
 4: 2003-10-01 -0.1380225  2.34157766
 5: 2004-10-01 -0.9288675 -1.32993998
 6: 2005-10-01 -0.9592633  0.76316150

That is, they all have three columns called date, x and y. My desired output is a merged data.table (on date) with five columns: date, and then the x column from each individual table renamed to reflect its original data.table:

head(desired_output)
          date      x_dt1       x_dt2      x_dt3      x_dt4
 1: 2000-10-01  0.4527087 -0.11590788  1.1581946 -1.5159040
 2: 2001-10-01  0.7200252 -0.55722270 -1.6247254 -0.3325556
 3: 2002-10-01 -1.3804472 -1.47030087 -0.9766309 -0.2368857
 4: 2003-10-01 -0.1380225  2.34157766  1.1831091 -0.4399184
 5: 2004-10-01 -0.9288675 -1.32993998  0.8716144 -0.4086229
 6: 2005-10-01 -0.9592633  0.76316150 -0.8860816 -0.4299365

I assume this can be done using the suffixes argument of merge.data.table somehow. I have tried to modify mergeDTs from this answer without success as yet. A solution that successfully modifies mergeDTs (or just using a function that could be applied to a list of of several data.tables) would be excellent.

I am aware of this very slick dplyr/purrr answer but would prefer a data.table solution.

Example data

library(data.table)
dt1 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

dt2 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

dt3 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

dt4 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

Solution

Below I have put B. Christian Kamgang's answer into functional form (to make it easily adaptable to my real problem) and removed dependency on the new pipe (since my organisation hasn't yet upgraded):

merge_select <- function(on, vars, ..., suffix = "_") {
  dts <- list(...)
  names(dts) <- sapply(as.list(substitute(list(...)))[-1L], deparse)
  
  nv <- length(vars)
  ndt <- length(dts)
  
  old_cols <- split(rep(vars, ndt),
                    ceiling(seq_along(rep(vars, ndt))/nv))
  
  new_cols <- split(paste0(vars, suffix, rep(names(dts), each = nv)),
                    ceiling(seq_along(paste0(vars, 
                                             suffix, 
                                             rep(names(dts), each = nv)))/nv))
  
  sep_cols <- lapply(dts, function(x) subset(x, select = c(on, vars)))
  
  Reduce(f = function(x,y) merge(x, y, by = on), 
         Map(f = setnames, sep_cols, old_cols, new_cols))
}

Which in my situation translates to:

merge_select("date", "x", dt1, dt2, dt3, dt4)
          date      x_dt1       x_dt2       x_dt3       x_dt4
 1: 2000-10-01 -0.6365707  0.11804268 -0.01084163 -0.88127011
 2: 2001-10-01 -0.2533127 -3.16924568  0.45746415  0.69742537
 3: 2002-10-01  2.3069266 -0.82670409 -0.54236745 -1.49613384
 4: 2003-10-01  0.7075547 -0.91809007 -0.67888707 -0.26106146
 5: 2004-10-01 -0.7165651 -0.45711888 -0.83903416  1.45113260
 6: 2005-10-01  0.5703561  0.24587897  0.13862020  0.33928202
 7: 2006-10-01 -0.6258097 -0.77652389 -0.49252474 -0.80460241
 8: 2007-10-01 -0.4600565  0.55612959  0.86749410 -1.30850411
 9: 2008-10-01 -0.8841649 -0.48113848 -1.55858406  0.83076846
10: 2009-10-01 -0.6262272 -0.73618265  0.13350581  0.06640803
11: 2010-10-01  0.1406454  0.08994779  1.28450204 -1.18329081

This solution also works for multiple variables, eg.

merge_select("date", c("x","y"), dt1, dt2, dt3, dt4)

Joris C. · Answer 1 · 2022-10-04T07:22:48.187

Here is a possible approach accumulating the merged result in a simple for loop:

library(data.table)

dt <- dt1[, .(date, x)]

for(i in 2:4) {
  dt <- merge(dt, get(paste0("dt", i))[, .(date, x)], by = "date", suffixes = c("", paste0("_dt", i)))
}

setnames(dt, old = "x", new = "x_dt1")

head(dt)
#>          date      x_dt1      x_dt2        x_dt3      x_dt4
#> 1: 2000-10-01 -1.5035218  2.0463775 -0.120544283 -0.5662290
#> 2: 2001-10-01  0.5977386 -0.1968421 -0.840102174  1.2412272
#> 3: 2002-10-01 -0.9100557 -0.1687148 -1.738526471  1.3685767
#> 4: 2003-10-01  0.7027232  0.9009135 -0.247273205  1.3135718
#> 5: 2004-10-01  0.5269265  0.6176381 -0.007662592 -0.2928206
#> 6: 2005-10-01 -0.8350406 -0.7343245 -0.643701996  2.3068948

Or alternatively, accumulating the merged result with Reduce():

Reduce(
  f = function(dt, dti) merge(dt, get(dti)[, .(date, x)], by = "date", suffixes = c("", paste0("_", dti))),
  x = paste0("dt", 1:4),
  init = dt1[, .(date, x)]
)[, x := NULL][]

Note: to get rid of the get() calls we can collect all data.tables in a list before merging or write a small function wrapper, e.g.

merge_dts <- function(...) {
  dts <- list(...)
  dt <- dts[[1]][, .(date, x)]
  for(i in seq_along(dts)[-1]) {
    dt <- merge(dt, dts[[i]][, .(date, x)], by = "date", suffixes = c("", paste0("_dt", i)))
  }
  setnames(dt, old = "x", new = "x_dt1")
  return(dt)
}

merge_dts(dt1, dt2, dt3, dt4)

Several great solutions, much appreciated. Ended up doing a mix of your last solution and B. Christian Kamgang’s solution (see my edited post). — diomedesdata, Oct 05 '22 at 05:28

score 1 · Accepted Answer · answered Oct 04 '22 at 09:09

Another solution:

nms = paste0("dt", 1:4)

lapply(mget(nms), `[`, j=c("date", "x")) |> 
  Map(f=setnames, "x", paste0("x_", nms)) |> 
  Reduce(f=\(x,y) merge(x, y, by="date"))

          date      x_dt1      x_dt2      x_dt3       x_dt4
        <Date>      <num>      <num>      <num>       <num>
 1: 2000-10-01  1.5567877 -0.3579598  1.9765388 -0.76325891
 2: 2001-10-01 -0.2656003  0.6599606 -0.8085454 -0.43846379
 3: 2002-10-01 -0.3620652 -1.1952980 -0.7108491 -0.57261896
 4: 2003-10-01  2.2358663  0.3903301 -1.0370673  1.27767750
 5: 2004-10-01  0.7298361 -0.4111746 -0.5046095  0.35132983
 6: 2005-10-01 -0.2749529 -0.3080462 -1.5936609  0.49925692
 7: 2006-10-01 -0.4624282 -0.6598809 -2.0893062 -0.10963342
 8: 2007-10-01 -1.3491373  1.1325984 -0.4708666  0.15541533
 9: 2008-10-01 -1.9459169  0.3331132 -0.5091114 -0.56520321
10: 2009-10-01 -0.8994460 -0.5218689 -0.4177664  0.69639938
11: 2010-10-01  0.8663232 -1.6789679 -0.7827229 -0.05562114

Excellent answer, I used a modified version of this (see my edited post). — diomedesdata, Oct 05 '22 at 05:26

Merge multiple data.tables and rename columns to reflect origin

Example data

Solution

2 Answers2