3

Is there a merging function that prioritizes non-missing values from common variables?

Consider the following example.

First we generate two data.frames with the same IDs but complementary missing values on a particular varuiable:

set.seed(1)
missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

Applying merge or any of the join functions from the dplyr package produces results similar to the below:

> merge(df1, df2, by = 'ID')
  ID      V1.x       V1.y
1  a        NA -1.5399500
2  b 1.3297993         NA
3  c 0.4146414         NA
4  d        NA -0.9285670
5  e        NA -0.2947204
6  f 1.2724293         NA

We'd like to join these two data.frames in a "smarter" way that ignores missing values in one data.frame when not missing in the other to obtain the below output:

> output <- df1
> output$V1[is.na(df1$V1)]  <- df2$V1[!(is.na(df2$V1))]
> output
  ID         V1
1  a -1.5399500
2  b  1.3297993
3  c  0.4146414
4  d -0.9285670
5  e -0.2947204
6  f  1.2724293

We can assume that df1 and df2 have totally complementary missing values of V1.

EDIT

A solution that would work for an arbitrary number of variables would be ideal.

Arun
  • 116,683
  • 26
  • 284
  • 387
Richard Border
  • 3,209
  • 16
  • 30
  • But what if they aren't complementary? If an ID *does* have non-missing values in both df1 and df2 do you want to retain both, or prioritize one? SQL would typically have you prioritize one using the `coalesce` function - [see here for implementations of `coalesce` in R](http://stackoverflow.com/q/19253820/903061). Of course it will still work if they are complementary as well. – Gregor Thomas Jun 08 '16 at 23:39
  • @Gregor The devel version of `dplyr` has an implementation of `coalsece` so you can simply do `dplyr::coalesce(df1, df2)`: – Steven Beaupré Jun 08 '16 at 23:42
  • Correct--you *would* want to retain both in such a situation. That's not my situation, but `coalesce` is sounding right--thank you both. Perhaps a more general (better?) question would ask how to implement a merge that selects one of two values based on some condition, not just missingness... – Richard Border Jun 08 '16 at 23:44
  • @StevenBeaupré I think (hope!) that it is more analogous to SQL and works on vectors not data frames. The idiom I'd hope for would be `full_join(df1, df2, by = 'ID') %>% mutate(V1 = coalesce(V1.x, V1.y)` – Gregor Thomas Jun 08 '16 at 23:48
  • @Buckminster I think it's two operations - a merge/join and a coalesce. Best to treat them as two steps. – Gregor Thomas Jun 08 '16 at 23:49
  • 1
    @Gregor got it. The issue that comes to mind is that there may be many such variables (my current situation) and it would be great to automate the coalescing! – Richard Border Jun 08 '16 at 23:51
  • 1
    That's a good point. In a case like this though when you're not really joining something like `na.omit(rbind(df1, df2))` works as well (equivalently to `merge(na.omit(df1), na.omit(df2), by = 'ID')`, which you more-or-less show). Not sure why that method is unsatisfactory. Are there potentially missing values in other columns you want to keep around? – Gregor Thomas Jun 08 '16 at 23:56
  • @Gregor exactly--there are missing values in other columns that are meaningful. I think my toy example might be a little too simplified to communicate all of the issues at hand. – Richard Border Jun 08 '16 at 23:58
  • @Gregor That would be the idiomatic way to proceed, indeed. – Steven Beaupré Jun 09 '16 at 00:02

4 Answers4

3

Thanks to the very helpful comments of @Gregor and @StevenBeaupré, I came up with a solution using coalesce.na from the kimisc package that extends to arbitrary numbers of variables:

mapply(function(x,y) coalesce.na(x,y), df1$V1, df2$V1)
[1] -1.5399500  1.3297993  0.4146414 -0.9285670 -0.2947204  1.2724293

Notice that df1$V1 and df2$V1 could be replaced lists of variables, allowing for something like:

> set.seed(1)
> missings  <- sample.int(6, 3)
> df1  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df2  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df1$V1[missings]  <- rnorm(3)
> df2$V1[setdiff(1:6, missings)]  <- rnorm(3)
> df1$V2[setdiff(1:6, missings)]  <- rnorm(3)
> df2$V2[missings]  <- rnorm(3)

> cbind(df1, df2)
  ID        V1           V2 ID         V1         V2
1  a        NA -0.005767173  a -1.5399500         NA
2  b 1.3297993           NA  b         NA -0.7990092
3  c 0.4146414           NA  c         NA -0.2894616
4  d        NA  2.404653389  d -0.9285670         NA
5  e        NA  0.763593461  e -0.2947204         NA
6  f 1.2724293           NA  f         NA -1.1476570

> dfMerged <- merge(df1, df2, by = 'ID')
> xList <- dfMerged[grep("\\.x$", names(dfMerged))]
> yList <- dfMerged[grep("\\.y$", names(dfMerged))]

> mapply(function(x,y) coalesce.na(x,y), xList, yList)
           V1.x         V2.x
[1,] -1.5399500 -0.005767173
[2,]  1.3297993 -0.799009249
[3,]  0.4146414 -0.289461574
[4,] -0.9285670  2.404653389
[5,] -0.2947204  0.763593461
[6,]  1.2724293 -1.147657009

A full solution would thus look something like:

library(kimisc)
smartMergeList <- function(dfList, idVar) {
    merged <- Reduce(x = dfList, 
                     f = function(x,y) merge(x, y, by = idVar, all = T))
    xList <- merged[grep("\\.x$", names(merged))]
    yList <- merged[grep("\\.y$", names(merged))]
    merged[names(xList)] <- mapply(function(x,y) coalesce.na(x,y),
                            xList, yList)
    merged[names(yList)] <- NULL
    merged
})

I would love to see something prettier though!

Richard Border
  • 3,209
  • 16
  • 30
2

If avoiding specifying the columns is the only issue with output$V1[is.na(df1$V1)] <- df2$V1[!(is.na(df2$V1))], then you just need to use na.omit() instead of specifying the variables individually. We can also do this only for overlapping columns.

Let's modify the original data so that the columns aren't a perfect overlap and the extra columns have some data we want to keep:

set.seed(1)

missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA, V2 = c(NA, 2, 3, NA, 5, 6))
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

# now df1 looks like this:
df1
#   ID        V1 V2
# 1  a        NA NA
# 2  b 1.3297993  2
# 3  c 0.4146414  3
# 4  d        NA NA
# 5  e        NA  5
# 6  f 1.2724293  6


common_cols = intersect(names(df1), names(df2))
result = na.omit(rbind(df1[common_cols], df2[common_cols]))
result = merge(result, df1, all.x = T)
result = merge(result, df2, all.x = T)
  # the merges are only necessary if there are additional columns to pick up

result
# ID           V1 V2
# 1  a -1.5399500 NA
# 2  b  1.3297993  2
# 3  c  0.4146414  3
# 4  d -0.9285670 NA
# 5  e -0.2947204 NA
# 6  f  1.2724293  6
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Hmmm, I see I still have a problem because the V2 value for ID `e` got lost. Probably the solution is to treat the ID column separately from the other `common_col` columns and merge on `ID` including only non-common columns... – Gregor Thomas Jun 09 '16 at 00:15
1

Based on the discussion and answers above, here is my take using dplyr. Not the cleanest code, yes I do have suppressWarnings().

For the reproducible example by OP:

set.seed(1)
missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

Simple solution:

library(dplyr)
library(reshape2)
coalesce <- function(...) {
  apply((...), 1, function(x) {
    x[which(!is.na(suppressWarnings(as.numeric(x))))[1]]
  })
}

full_join(df1, df2, by = 'ID') %>% mutate(V1 = coalesce(.)) %>% select(.,ID,V1)

  ID         V1
1  a -1.5399500
2  b  1.3297993
3  c  0.4146414
4  d -0.9285670
5  e -0.2947204
6  f  1.2724293

For a general solution with multiple (here shown with 3) variables:

set.seed(1)
df1  <- data.frame(ID = letters[1:6], V1 = NA, V2 = NA, V3 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA, V2 = NA, V3 = NA)
df1$V1[sample.int(6, 3)]  <- rnorm(3)
df2$V1[setdiff(1:6, sample.int(6, 3))]  <- rnorm(3)
df1$V2[sample.int(6, 3)]  <- rnorm(3)
df2$V2[setdiff(1:6, sample.int(6, 3))]  <- rnorm(3)
df1$V3[sample.int(6, 3)]  <- rnorm(3)
df2$V3[setdiff(1:6, sample.int(6, 3))]  <- rnorm(3)

Same coalesce() function, more elaborate dplyr logic:

library(dplyr)
library(reshape2)
coalesce <- function(...) {
  apply((...), 1, function(x) {
    x[which(!is.na(suppressWarnings(as.numeric(x))))[1]]
  })
}

full_join(df1, df2, by = "ID") %>% 
  melt(., id.vars = "ID") %>%
  mutate(var = substr(as.character(variable),0,2)) %>%
  group_by(var,value) %>% 
  dcast(.,ID + var ~ variable, value.var = "value") %>%
  mutate(c = coalesce(.)) %>%
  dcast(.,ID ~ var, value.var = "c")

  ID         V1         V2          V3
1  a -1.5399500  0.3898432        <NA>
2  b -0.9285670 -0.3053884  0.80418951
3  c -0.8356286       <NA>   0.5939013
4  d  0.1836433 -0.4115108 -0.05710677
5  e       <NA>       <NA>   0.8212212
6  f -0.6264538  1.5117812   0.9189774

The coalesce() function picks the first non-NA value (if it exists). You can choose max or something else per your problem. The ID column or any other columns in the merged dataset must be non-numeric. I hope this is some help toward solving your real problem.

Divi
  • 1,614
  • 13
  • 23
  • I like that your solution generalizes to conditions other than missingness, i.e., replacing `which(!is.na())` with `which.max` would select largest of all values. – Richard Border Jun 09 '16 at 20:08
  • I have not tested it other than the dataset in this question. Let me know if it works for the multiple variables in your problem. – Divi Jun 09 '16 at 20:17
0

Here is a possible data.table way to do it:

library(data.table)
setDT(df1); setDT(df2);

df1[df2, V1 := ifelse(is.na(V1), i.V1, V1), on = "ID"]

df1
#    ID         V1
# 1:  a -1.5399500
# 2:  b  1.3297993
# 3:  c  0.4146414
# 4:  d -0.9285670
# 5:  e -0.2947204
# 6:  f  1.2724293
Psidom
  • 209,562
  • 33
  • 339
  • 356