14

suffixes in merge works only on common column names. Is there anyway to extend this to the rest of the columns as well without manually updating columns before the merge?

That is -

df1 <- data.table(
a = c(1,2,3,4,5,6),
b = c('a','b','f','e','r','h'),
d = c('q','l','o','n','q','z')
)

df2 <- data.table(
a = c(1,2,3,4,5,6),
d = c('q','l','o','n','q','z')
)

colnames(merge(df1,df2, by = 'a', suffixes = c("1","2")))
#[1] "a"  "b"  "d1" "d2" what it does
#[1] "a"  "b1" "d1" "d2" what I'd like it to do

The current way I'm handling this resembles @mrip's answer.

df1 <- data.table(
a = c(1,2,3,4,5,6),
b = c('a','b','f','e','r','h'),
r = c('a','b','f','e','r','h'),
d = c('q','l','o','n','q','z')
)

df2 <- data.table(
a = c(1,2,3,4,5,6),
c = c('a','b','f','e','r','h'),
q = c('a','b','f','e','r','h'),
d = c('q','l','o','n','q','z')
)

dfmerge <- (merge(df1,df2, by = c("a"), suffixes = c("1","2")))

setnames(
dfmerge,
setdiff(names(df1),names(df2)),
paste0(setdiff(names(df1),names(df2)),"1")
)

setnames(
dfmerge,
setdiff(names(df2),names(df1)),
paste0(setdiff(names(df2),names(df1)),"2")
)

colnames(dfmerge)
#[1] "a"  "b1" "r1" "d1" "c2" "q2" "d2"
TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54
  • Presumably you mean `a1` in that last example, not just `a`? – joran Oct 02 '13 at 16:46
  • It's inconvenient, but with the efficiency of `setnames`, I wonder if renaming before you merge will end up being the most straightforward solution. – A5C1D2H2I1M1N2O1R2T1 Oct 02 '13 at 16:46
  • @joran, I'm guessing they don't since that is the "by" column. – A5C1D2H2I1M1N2O1R2T1 Oct 02 '13 at 16:47
  • @AnandaMahto Gotcha. I agree that there won't be a way to do this (with `merge`) that doesn't involved renaming the columns yourself at some point. – joran Oct 02 '13 at 16:48
  • @MattDowle, I used `data.table` out of habit but I will accept an answer even if it isn't based on data.table. However, data.table compatible answer preferred. – TheComeOnMan Oct 02 '13 at 17:00
  • The point of the `suffixes` argument (in either `data.frame` or `data.table`) is to give the user the ability to distinguish columns that have the same name (which otherwise would be hard after a merge). As such I don't understand why you'd want `merge` to do this. If you don't like the names you have, just change them yourself. – eddi Oct 02 '13 at 17:30
  • @eddi, perhaps they want to indicate which `df` the column originally came from – Ricardo Saporta Oct 02 '13 at 17:36
  • @RicardoSaporta, you're right. – TheComeOnMan Oct 02 '13 at 17:40

3 Answers3

12

A simple solution:

mrg<-(merge(df1,df2, by = 'a', suffixes = c("1","2")))
setnames(mrg,paste0(names(mrg),ifelse(names(mrg) %in% setdiff(names(df1),names(df2)),"1","")))
setnames(mrg,paste0(names(mrg),ifelse(names(mrg) %in% setdiff(names(df2),names(df1)),"2","")))

> names(mrg)
[1] "a"  "b1" "d1" "d2"

Edit: thanks to comments by Ricardo Saporta for cleaning this up considerably and teaching me a few new tips!

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
mrip
  • 14,913
  • 4
  • 40
  • 58
  • @RicardoSaporta Yes! Thanks, I was confused about the. Cleaned it up. – mrip Oct 02 '13 at 17:41
  • @RicardoSaporta Yes again! Thanks, that will save me a lot of `,sep="")` in the future. Great tip~ – mrip Oct 02 '13 at 17:43
  • @eddi, why? the `by` values are the ones that do NOT require appending a suffix – Ricardo Saporta Oct 02 '13 at 17:44
  • @mrip, thanks. You and I are thinking along the same lines. I added the code I'm currently using to the question if you'd like to have a look at it. – TheComeOnMan Oct 02 '13 at 17:56
  • Not that I'm Scrooge but I've given it an upvote for now because I think mrip deserves it as well. I'm waiting for a more deserving answer which involves minimal/no manual extra work as compared to what I was doing, but if none appears then the checkmark is all @mrip's. – TheComeOnMan Oct 02 '13 at 18:31
6

Try the following:

colnames(
  mergeWithSuffix(df1,df2, by = 'a', suffixes = c("1","2"))
)
[1] "a"   "b.1" "d.1" "d.2"

Notice that the original data.frames are unharmed.

colnames(df1)
[1] "a" "b" "d"

colnames(df2)
[1] "a" "d"

The functions are as follows

require(data.table)

mergeWithSuffix <- function(x, y, by, suffixes=NULL, ...) {

  # Add Suffixes
  mkSuffix(x, suffixes[[1]], merge.col=by)
  mkSuffix(y, suffixes[[2]], merge.col=by)

  # Merge
  ret <- merge(x, y, by = by, suffixes = NULL, ...)

  # Remove Suffixes
  undoSuffix(x, suffixes[[1]], merge.col=by)
  undoSuffix(y, suffixes[[2]], merge.col=by)
  return(ret)
}

mkSuffix <- function(x, sfx, sep=".", merge.col=NULL)  {
  nms <- setdiff(names(x), merge.col)
  setnames(x, nms, paste(nms, sfx, sep=".") ) 
}

undoSuffix <- function(x, sfx, sep=".", merge.col=NULL) {
  nms <- setdiff(names(x), merge.col)
  setnames(x, nms, sub(paste0(get("sep"), sfx, "$"), "", nms))
}

Notice that setnames works by reference, so the overhead is almost negligible. Also, as discussed elsewhere, this works equally well on data.frames and data.table

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • This answer can be generalized to merge multiple data frames: https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list – user3226167 Jul 14 '17 at 07:12
1

This is an interesting question, and I doubt that extending merge would be a straightforward solution unless Matt Dowle and Co. think it's something worth implementing in merge.data.table.

Here's one approach that came to mind:

DTs <- c("df1", "df2")
suffixes <- seq_along(DTs)

for (i in seq_along(DTs)) {
  Name <- setdiff(colnames(get(DTs[i])), "a")
  setnames(get(DTs[i]), Name, paste(Name, suffixes[i], sep = "."))
}

merge(df1, df2, by = "a") # Will obviously work as you expect now
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485