2

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.

Input

> d1
  id  x  y
1  1  1 NA
2  2 NA  3
3  3  4 NA
> d2
  id  x  y z
1  4 NA 30 a
2  3 20  2 b
3  2 14 NA c
4  1 15 97 d

(note that the actual data.frames have hundreds of columns)

Expected output:

> d1
  id  x  y z
1  1  1 97 d
2  2 14  3 c
3  3  4  2 b

Data and current solution:

d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])

for (col in setdiff(names(d1), "id")) {
  # If missing look in d2
  missing <- is.na(d1[[col]])
  d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
} 
for (col in setdiff(names(d2), names(d1))) {
  # If column missing then add
  d1[[col]] <- d2[match(d1$id, d2$id), col]
}

PS:

Likely this questions has been asked before but I'm lacking in vocabulary to search it.

s_baldur
  • 29,441
  • 4
  • 36
  • 69

3 Answers3

2

Here is a possibility using dplyr::left_join:

left_join(d1, d2, by = "id") %>%
    mutate(
        x = ifelse(!is.na(x.x), x.x, x.y),
        y = ifelse(!is.na(y.x), y.x, y.y)) %>%
    select(id, x, y, z)
#  id  x  y z
#1  1  1 97 d
#2  2 14  3 c
#3  3  4  2 b
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • I failed to mention this in my op but the actual data.frames have hundreds of columns, so I would need to add more automation to this. – s_baldur May 31 '18 at 08:50
2

Assuming you are working with 2 data.frames, here is a base solution

#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]

#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])

If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.


edit:

Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R

coalesce.mat <- function(...) {
    ans <- ..1  
    for (elt in list(...)[-1]) {
        rn <- match(ans$id, elt$id)
        ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
    }
    ans         
}

allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat, 
    lapply(list(d1, d2), function(x) {
        x[, setdiff(allcols, names(x))] <- NA
        x 
    }))

edit:

a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

setDT(d1)
setDT(d2)

#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)

#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]

#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
    d1, .SD, on=.(id)]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Maybe best you write it from scratch and just mention you were inspired by this question? – s_baldur May 31 '18 at 08:58
  • `d2[match(d1$id, d2$id), names(d)]` fails when d1 contains columns that are not missing, so maybe amend to: `intersect(names(d1), names(d2))` – s_baldur May 31 '18 at 09:05
1

We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset

library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1), 
              mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
#   id  x  y z
#1:  1  1 97 d
#2:  2 14  3 c
#3:  3  4  2 b
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Get an error on the actual data (not the example data though) `Error: Argument 2 must be type character, not double` – s_baldur May 31 '18 at 10:40
  • 1
    @snoram Thanks for notifying me. We are assigning the values to the original dataset. Please check if you have any mismatch in type. If that is the case, change those mismatches first and then do the update – akrun May 31 '18 at 14:25