Merging heterogeneous data.frames

Question

I'm trying to merge two data.frames in R

d1 <- data.frame(Id=1:3,Name=c("Yann","Anne","Sabri"),Age=c(21,19,31),Height=c(178,169,192),Grade=c(15,12,18))
d2 <- data.frame(Id=c(1,3,4),Name=c("Yann","Sabri","Jui"),Age=c(28,21,15),Sex=c("M","M","F"),City=c("Paris","Paris","Toulouse"))

I'd like to merge by Id, and retaining only Id, Name, Age, Sex and Grade columns in the final data.frame.

I've come up with a lengthy code that does the job, but is there any better way?

dm <- data.frame(Id=unique(c(d1$Id,d2$Id)))
dm.d1.rows <- sapply(dm$Id, match, table = d1$Id)
dm.d2.rows <- sapply(dm$Id, match, table = d2$Id)
for(i in c("Name", "Age","Sex","Grade")) {
    if(i %in% colnames(d1) && is.factor(d1[[i]]) || i %in% colnames(d2) && is.factor(d2[[i]])) dm[[i]]<- factor(rep(NA,nrow(dm)),
            levels=unique(c(levels(d1[[i]]),levels(d2[[i]]))))
    else dm[[i]]<- rep(NA,nrow(dm))
    if(i %in% colnames(d1)) dm[[i]][!is.na(dm.d1.rows)] <- d1[[i]][na.exclude(dm.d1.rows)]
    if(i %in% colnames(d2)) dm[[i]][!is.na(dm.d2.rows)] <- d2[[i]][na.exclude(dm.d2.rows)]
}

`dm2 <- merge(d1, d2, by=c("Id", "Name"), all=TRUE); dm2$Age <- with(dm2, ifelse(is.na(Age.x), Age.y, Age.x)); dm2[c("Id", "Name", "Age", "Sex", "Grade")]` — jogo, Sep 24 '18 at 11:20
relevant (not duplicate) : https://stackoverflow.com/questions/27167151/merge-combine-columns-with-same-name-but-incomplete-data/51386513#51386513 — moodymudskipper, Sep 27 '18 at 07:36

Sotos · Accepted Answer · 2018-09-27T07:51:43.993

Here is an idea via tidyverse, using the function coalesce. This function basically replaces the NA values with the values of another (specified) column. - You can find more information and implementations of the function coalesce here

Official Documentation for coalesce: Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs.

library(tidyverse)

d1 %>% 
 full_join(d2, by = c('Id', 'Name')) %>% 
 mutate(Age = coalesce(Age.x, Age.y)) %>% 
 select(Id, Name, Age, Sex, Grade)

which gives,

  Id  Name Age  Sex Grade
1  1  Yann  21    M    15
2  2  Anne  19 <NA>    12
3  3 Sabri  31    M    18
4  4   Jui  15    F    NA

Similarly, in data.table syntax,

library(data.table)

#Convert to data.tables
d1_t <- setDT(d1)
d2_t <- setDT(d2)

merge(d1_t, d2_t, by = c('Id', 'Name'), all = TRUE)[,
            Age := ifelse(is.na(Age.x), Age.y, Age.x)][, 
              c('Age.x', 'Age.y', 'City', 'Height') := NULL][]

which gives,

   Id  Name Grade  Sex Age
1:  1  Yann    15    M  21
2:  2  Anne    12 <NA>  19
3:  3 Sabri    18    M  31
4:  4   Jui    NA    F  15

abcalphabet · Answer 2 · 2018-09-27T07:20:42.350

Personally I'm a big fan of sqldf, which allows you to use SQL queries to create/ manipulate data frames. In your case the statement below should do the trick.

d1 <- data.frame(Id=1:3,Name=c("Yann","Anne","Sabri"),Age=c(21,19,31),
    Height=c(178,169,192),Grade=c(15,12,18))
d2 <- data.frame(Id=c(1,3,4),Name=c("Yann","Sabri","Jui"),Age=c(28,21,15),
    Sex=c("M","M","F"),City=c("Paris","Paris","Toulouse"))

d3 = sqldf("SELECT d1.Id, d1.Name, d1.Age, d2.Sex , d1.Grade
            FROM d1
            LEFT JOIN d2 ON d1.Id = d2.Id
            UNION
            SELECT d2.Id, d2.Name, coalesce(d1.Age, d2.Age) , d2.Sex, coalesce(d1.Grade, NULL)
            FROM d2 
            LEFT JOIN d1 ON d2.Id = d1.Id")

Especially for more complicated dataframe merging/ manipulation the use of sqldf/SQL can be useful.

EDIT: Used working sqldf / R environment to fix SQL statement, resulting in the table below:

Id  Name Age  Sex Grade
1  Yann  21    M    15
2  Anne  19 <NA>    12
3 Sabri  31    M    18
4   Jui  15    F    NA

moodymudskipper · Answer 3 · 2018-09-26T22:45:38.793

In base R :

d1 <- data.frame(Id=1:3,Name=c("Yann","Anne","Sabri"),Age=c(21,19,31),Height=c(178,169,192),Grade=c(15,12,18),stringsAsFactors = F)
d2 <- data.frame(Id=c(1,3,4),Name=c("Yann","Sabri","Jui"),Age=c(28,21,15),Sex=c("M","M","F"),City=c("Paris","Paris","Toulouse"),stringsAsFactors = F)
nms <- c("Id","Name", "Age", "Sex", "Grade")

. <- merge(d2,d1,all=TRUE,sort=FALSE)[nms]
aggregate(.,list(.$Id), function(x) c(na.omit(x),NA)[1])[-1]
#   Id  Name Age  Sex Grade
# 1  1  Yann  28    M    15
# 2  2  Anne  19 <NA>    12
# 3  3 Sabri  21    M    18
# 4  4   Jui  15    F    NA

note the stringsAsFactors = F, you'll need to convert factors to characters before applying this solution.

invert `d1` and `d2` in first line to have same output as other answers, but this reproduces `dm` from OP. — moodymudskipper, Sep 26 '18 at 22:44

score 1 · Answer 4 · answered Sep 27 '18 at 03:24

This might not be an ideal answer but here is a non-merge , non-join option using sapply since we want to combine the two dataframes using only one column

#Name the cols which you want in the final data frame
cols <- c("Id", "Name", "Age", "Sex","Grade")
#Get all unique id's 
ids <- union(d1$Id, d2$Id)

#Loop over each ID
data.frame(t(sapply(ids, function(x) {
   #Get indices in d1 where Id is present
   d1inds <- d1$Id == x
   #Get indices in d2 where Id is present
   d2inds <- d2$Id == x

   #If the Id is present in both d1 AND d2
   if (any(d1inds) & any(d2inds))

     #Combine d2 and d1 and select only cols column
     #This is based on your expected output that in case if the ID is same 
     #we want to prefer Name and Age column from d2 rather than d1 
     return(cbind(d2[d2inds, ], d1[d1inds, ])[cols])
     #If you want to prefer d1 over d2, we can do
     #return(cbind(d1[d1inds, ], d2[d2inds, ])[cols])

   #If the Id is present only in d1, add a "Sex" column with NA
   if (any(d1inds))
      return(cbind(d1[d1inds, ], "Sex" = NA)[cols])

   #If the Id is present only in d2, add a "Grade" column with NA
   else     
      return(cbind(d2[d2inds, ], "Grade" = NA)[cols])
})))

#  Id  Name Age Sex Grade
#1  1  Yann  28   M    15
#2  2  Anne  19  NA    12
#3  3 Sabri  21   M    18
#4  4   Jui  15   F    NA

data

d1 <- data.frame(Id=1:3,Name=c("Yann","Anne","Sabri"),Age=c(21,19,31),
    Height=c(178,169,192),Grade=c(15,12,18), stringsAsFactors = FALSE)
d2 <- data.frame(Id=c(1,3,4),Name=c("Yann","Sabri","Jui"),Age=c(28,21,15),
   Sex=c("M","M","F"),City=c("Paris","Paris","Toulouse"), stringsAsFactors = FALSE)

moodymudskipper · Answer 5 · 2019-03-02T22:47:36.173

You could use my package safejoin, make a full join and deal with the conflicts using dplyr::coalesce. We also use dplyr::one_of so we don't have to choose columns by side manually.

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)

keep <- c("Id", "Name", "Age", "Sex", "Grade")
safe_full_join(select(d1,one_of(keep)), select(d2,one_of(keep)),  
  by = c("Id","Name"), conflict = coalesce, check="")
#   Id  Name Age Grade  Sex
# 1  1  Yann  21    15    M
# 2  2  Anne  19    12 <NA>
# 3  3 Sabri  31    18    M
# 4  4   Jui  15    NA    F

Merging heterogeneous data.frames

5 Answers5