Update existing data.frame with values from another one if missing

Question

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.

Input

> d1
  id  x  y
1  1  1 NA
2  2 NA  3
3  3  4 NA
> d2
  id  x  y z
1  4 NA 30 a
2  3 20  2 b
3  2 14 NA c
4  1 15 97 d

(note that the actual data.frames have hundreds of columns)

Expected output:

> d1
  id  x  y z
1  1  1 97 d
2  2 14  3 c
3  3  4  2 b

Data and current solution:

d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])

for (col in setdiff(names(d1), "id")) {
  # If missing look in d2
  missing <- is.na(d1[[col]])
  d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
} 
for (col in setdiff(names(d2), names(d1))) {
  # If column missing then add
  d1[[col]] <- d2[match(d1$id, d2$id), col]
}

PS:

Likely this questions has been asked before but I'm lacking in vocabulary to search it.

maybe the word u are looking for is coalesce? – chinsoon12 May 31 '18 at 08:40 — chinsoon12, May 31 '18 at 08:40

score 2 · Answer 1 · answered May 31 '18 at 08:34

2

Here is a possibility using dplyr::left_join:

left_join(d1, d2, by = "id") %>%
    mutate(
        x = ifelse(!is.na(x.x), x.x, x.y),
        y = ifelse(!is.na(y.x), y.x, y.y)) %>%
    select(id, x, y, z)
#  id  x  y z
#1  1  1 97 d
#2  2 14  3 c
#3  3  4  2 b

answered May 31 '18 at 08:34

Maurits Evers

49,617
4
47
68

I failed to mention this in my op but the actual data.frames have hundreds of columns, so I would need to add more automation to this. – s_baldur May 31 '18 at 08:50

chinsoon12 · Accepted Answer · 2018-06-01T01:44:54.367

Assuming you are working with 2 data.frames, here is a base solution

#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]

#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])

If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.

edit:

Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R

coalesce.mat <- function(...) {
    ans <- ..1  
    for (elt in list(...)[-1]) {
        rn <- match(ans$id, elt$id)
        ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
    }
    ans         
}

allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat, 
    lapply(list(d1, d2), function(x) {
        x[, setdiff(allcols, names(x))] <- NA
        x 
    }))

edit:

a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

setDT(d1)
setDT(d2)

#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)

#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]

#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
    d1, .SD, on=.(id)]

Maybe best you write it from scratch and just mention you were inspired by this question? — s_baldur, May 31 '18 at 08:58
`d2[match(d1$id, d2$id), names(d)]` fails when d1 contains columns that are not missing, so maybe amend to: `intersect(names(d1), names(d2))` — s_baldur, May 31 '18 at 09:05

akrun · Answer 3 · 2018-05-31T08:59:48.350

1

We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset

library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1), 
              mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
#   id  x  y z
#1:  1  1 97 d
#2:  2 14  3 c
#3:  3  4  2 b

edited May 31 '18 at 08:59

answered May 31 '18 at 08:34

akrun

874,273
37
540
662

Get an error on the actual data (not the example data though) `Error: Argument 2 must be type character, not double` – s_baldur May 31 '18 at 10:40
1

@snoram Thanks for notifying me. We are assigning the values to the original dataset. Please check if you have any mismatch in type. If that is the case, change those mismatches first and then do the update – akrun May 31 '18 at 14:25

Update existing data.frame with values from another one if missing

3 Answers3

Linked