Extending Suffixes in Merge to All Non-by Columns

Question

suffixes in merge works only on common column names. Is there anyway to extend this to the rest of the columns as well without manually updating columns before the merge?

That is -

df1 <- data.table(
a = c(1,2,3,4,5,6),
b = c('a','b','f','e','r','h'),
d = c('q','l','o','n','q','z')
)

df2 <- data.table(
a = c(1,2,3,4,5,6),
d = c('q','l','o','n','q','z')
)

colnames(merge(df1,df2, by = 'a', suffixes = c("1","2")))
#[1] "a"  "b"  "d1" "d2" what it does
#[1] "a"  "b1" "d1" "d2" what I'd like it to do

The current way I'm handling this resembles @mrip's answer.

df1 <- data.table(
a = c(1,2,3,4,5,6),
b = c('a','b','f','e','r','h'),
r = c('a','b','f','e','r','h'),
d = c('q','l','o','n','q','z')
)

df2 <- data.table(
a = c(1,2,3,4,5,6),
c = c('a','b','f','e','r','h'),
q = c('a','b','f','e','r','h'),
d = c('q','l','o','n','q','z')
)

dfmerge <- (merge(df1,df2, by = c("a"), suffixes = c("1","2")))

setnames(
dfmerge,
setdiff(names(df1),names(df2)),
paste0(setdiff(names(df1),names(df2)),"1")
)

setnames(
dfmerge,
setdiff(names(df2),names(df1)),
paste0(setdiff(names(df2),names(df1)),"2")
)

colnames(dfmerge)
#[1] "a"  "b1" "r1" "d1" "c2" "q2" "d2"

Presumably you mean `a1` in that last example, not just `a`? — joran, Oct 02 '13 at 16:46
It's inconvenient, but with the efficiency of `setnames`, I wonder if renaming before you merge will end up being the most straightforward solution. — A5C1D2H2I1M1N2O1R2T1, Oct 02 '13 at 16:46
@joran, I'm guessing they don't since that is the "by" column. — A5C1D2H2I1M1N2O1R2T1, Oct 02 '13 at 16:47
@AnandaMahto Gotcha. I agree that there won't be a way to do this (with `merge`) that doesn't involved renaming the columns yourself at some point. — joran, Oct 02 '13 at 16:48
@MattDowle, I used `data.table` out of habit but I will accept an answer even if it isn't based on data.table. However, data.table compatible answer preferred. — TheComeOnMan, Oct 02 '13 at 17:00
The point of the `suffixes` argument (in either `data.frame` or `data.table`) is to give the user the ability to distinguish columns that have the same name (which otherwise would be hard after a merge). As such I don't understand why you'd want `merge` to do this. If you don't like the names you have, just change them yourself. — eddi, Oct 02 '13 at 17:30
@eddi, perhaps they want to indicate which `df` the column originally came from — Ricardo Saporta, Oct 02 '13 at 17:36

score 12 · Accepted Answer · edited Oct 02 '13 at 18:26

12

A simple solution:

mrg<-(merge(df1,df2, by = 'a', suffixes = c("1","2")))
setnames(mrg,paste0(names(mrg),ifelse(names(mrg) %in% setdiff(names(df1),names(df2)),"1","")))
setnames(mrg,paste0(names(mrg),ifelse(names(mrg) %in% setdiff(names(df2),names(df1)),"2","")))

> names(mrg)
[1] "a"  "b1" "d1" "d2"

Edit: thanks to comments by Ricardo Saporta for cleaning this up considerably and teaching me a few new tips!

edited Oct 02 '13 at 18:26

Ricardo Saporta

54,400
17
144
178

answered Oct 02 '13 at 17:34

mrip

14,913
4
40
58

@RicardoSaporta Yes! Thanks, I was confused about the. Cleaned it up. – mrip Oct 02 '13 at 17:41
@RicardoSaporta Yes again! Thanks, that will save me a lot of `,sep="")` in the future. Great tip~ – mrip Oct 02 '13 at 17:43
@eddi, why? the `by` values are the ones that do NOT require appending a suffix – Ricardo Saporta Oct 02 '13 at 17:44
@mrip, thanks. You and I are thinking along the same lines. I added the code I'm currently using to the question if you'd like to have a look at it. – TheComeOnMan Oct 02 '13 at 17:56
Not that I'm Scrooge but I've given it an upvote for now because I think mrip deserves it as well. I'm waiting for a more deserving answer which involves minimal/no manual extra work as compared to what I was doing, but if none appears then the checkmark is all @mrip's. – TheComeOnMan Oct 02 '13 at 18:31

score 6 · Answer 2 · answered Oct 02 '13 at 17:32

Try the following:

colnames(
  mergeWithSuffix(df1,df2, by = 'a', suffixes = c("1","2"))
)
[1] "a"   "b.1" "d.1" "d.2"

Notice that the original data.frames are unharmed.

colnames(df1)
[1] "a" "b" "d"

colnames(df2)
[1] "a" "d"

The functions are as follows

require(data.table)

mergeWithSuffix <- function(x, y, by, suffixes=NULL, ...) {

  # Add Suffixes
  mkSuffix(x, suffixes[[1]], merge.col=by)
  mkSuffix(y, suffixes[[2]], merge.col=by)

  # Merge
  ret <- merge(x, y, by = by, suffixes = NULL, ...)

  # Remove Suffixes
  undoSuffix(x, suffixes[[1]], merge.col=by)
  undoSuffix(y, suffixes[[2]], merge.col=by)
  return(ret)
}

mkSuffix <- function(x, sfx, sep=".", merge.col=NULL)  {
  nms <- setdiff(names(x), merge.col)
  setnames(x, nms, paste(nms, sfx, sep=".") ) 
}

undoSuffix <- function(x, sfx, sep=".", merge.col=NULL) {
  nms <- setdiff(names(x), merge.col)
  setnames(x, nms, sub(paste0(get("sep"), sfx, "$"), "", nms))
}

Notice that setnames works by reference, so the overhead is almost negligible. Also, as discussed elsewhere, this works equally well on data.frames and data.table

This answer can be generalized to merge multiple data frames: https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list — user3226167, Jul 14 '17 at 07:12

score 1 · Answer 3 · answered Oct 02 '13 at 17:29

This is an interesting question, and I doubt that extending merge would be a straightforward solution unless Matt Dowle and Co. think it's something worth implementing in merge.data.table.

Here's one approach that came to mind:

DTs <- c("df1", "df2")
suffixes <- seq_along(DTs)

for (i in seq_along(DTs)) {
  Name <- setdiff(colnames(get(DTs[i])), "a")
  setnames(get(DTs[i]), Name, paste(Name, suffixes[i], sep = "."))
}

merge(df1, df2, by = "a") # Will obviously work as you expect now

(I'm too tired to think about whether there are any performance impacts of using `get` the way it was done here....) — A5C1D2H2I1M1N2O1R2T1, Oct 02 '13 at 17:31

Extending Suffixes in Merge to All Non-by Columns

3 Answers3

Linked

Related