21

How do I merge 2 similar data frames but have one with greater importance?

For example:

Dataframe 1

Date      Col1    Col2
jan         2      1
feb         4      2
march       6      3
april       8      NA

Dataframe 2

Date      Col2    Col3
jan         9      10
feb         8      20
march       7      30
april       6      40

merge these by Date with dataframe 1 taking precedence but dataframe 2 filling blanks

DataframeMerge

Date      Col1    Col2    Col3
jan         2       1      10
feb         4       2      20
march       6       3      30
april       8       6      40

EDIT - SOLUTION

commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != "key"]
dfmerge<- merge(df1,df2,by="key",all=T)
for(i in commonNames){
  left <- paste(i, ".x", sep="")
  right <- paste(i, ".y", sep="")
  dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
  dfmerge[right]<- NULL
  colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
EvilWeebl
  • 689
  • 1
  • 8
  • 16

4 Answers4

14
merdat <- merge(dfrm1,dfrm2, by="Date")  # seems self-documenting

#  explanation for next line in text below.
merdat$Col2.y[ is.na(merdat$Col2.y) ] <- merdat$Col2.x[ is.na(merdat$Col2.y) ]

Then just rename 'merdat$Col2.y' to 'merdat$Col2' and drop 'merdat$Col2.x'.

In reply to request for more comments: One way to update only sections of a vector is to construct a logical vector for indexing and apply it using "[" to both sides of an assignment. Another way is to devise a logical vector that is only on the LHS of an assignment but then make a vector using rep() that has the same length as sum(logical.vector). The goal is both instances is to have the same length (and order) for assignment as the items being replaced.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • nice answer, but a few more comments in the code will make it very useful. – Sam Apr 16 '13 at 17:02
  • Appreciate the answer but I' sorry to say I forgot to mention one vital piece of info, other than the date(key) column I wont know the other columns in the tables. Some may match some may not. I suppose I could just use a match between names(dfrm1) and names(dfrm2) to get the columns I need to apply your code to? (appending the x and y to col names of course) – EvilWeebl Apr 17 '13 at 09:04
  • Thanks for the help, I took your answer and expanded on it to cover when the similar columns are unknown. I've added the solution to my question to help anyone in the future. – EvilWeebl Apr 17 '13 at 15:32
  • Thanks. You might consider making the determination of "common names" more compact using the `intersect` function. `ComNams <- intersect(names(df1), names(df2))` – IRTFM Apr 17 '13 at 17:02
11

Update using v1.9.6 of data.table's on= argument (which allows for adhoc joins:

setDT(df1)[df2, `:=`(Col2 = ifelse(is.na(Col2), i.Col2, Col2), 
                     Col3 = i.Col3), on="Date"][]

Here's a data.table solution. Make sure your df1 and df2's Date column is factor with desired levels (for ordering)

require(data.table)
dt1 <- data.table(df1, key="Date")
dt2 <- data.table(df2, key="Date")
# Col2 refers to the Col2 of dt1 and i.col2 refers to that of dt2
dt1[dt2, `:=`(Col3 = Col3, Col1 = Col1, 
        Col2 = ifelse(is.na(Col2), i.Col2, Col2))]

# the result is stored in dt1
> dt1
#     Date Col1 Col2 Col3
# 1:   jan    2    1   10
# 2:   feb    4    2   20
# 3: march    6    3   30
# 4: april    8    6   40
Arun
  • 116,683
  • 26
  • 284
  • 387
  • data.table has a function for pulling non-null values in multiple columns, see: https://rdrr.io/cran/data.table/man/coalesce.html – Feng Jiang Feb 02 '21 at 20:48
9

Here is a dplyr solution. Credit to @docendo discimus

df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4)) 

  y x1
1 A  1
2 B  2
3 C NA
4 D  4

df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))

  y x1
1 A  5
2 B  6
3 C  7

dplyr

left_join(df1, df2, by="y") %>% 
transmute(y, x1 = ifelse(is.na(x1.y), x1.x, x1.y))

  y x1
1 A  5
2 B  6
3 C  7
Vedda
  • 7,066
  • 6
  • 42
  • 77
5

Consider this example:

> d1 <- data.frame(x=1:4, a=2:5, b=c(3,4,5,NA))
> d1
  x a  b
1 1 2  3
2 2 3  4
3 3 4  5
4 4 5 NA
> d2 <- data.frame(x=1:4, b=c(6,7,8,9), c=11:14)
> d2
  x b  c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14

Now use merge and within, with ifelse:

> within(merge(d1, d2, by="x"), {b <- ifelse(is.na(b.x),b.y,b.x); b.x <- NULL; b.y <- NULL})
  x a  c b
1 1 2 11 3
2 2 3 12 4
3 3 4 13 5
4 4 5 14 9
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
  • 2
    This is cool, but it only works if you already know which column names are going to be available in both dataframes, is not the case in OP's question (see clarifying commit on [42-'s answer](https://stackoverflow.com/a/16042494/210945)). – naught101 Jul 20 '17 at 01:49