How to combine two columns of a data-frame with missing data?

Question

This is an extension of this earlier question. How can I combine two columns of a data frame as

data <- data.frame('a' = c('A','B','C','D','E'),
                   'x' = c("t",2,NA,NA,NA),
                   'y' = c(NA,NA,NA,4,"r"))

displayed as

'a' 'x' 'y'  
 A   t   NA  
 B   2   NA  
 C  NA   NA  
 D  NA   4  
 E  NA   r

to get

 'a' 'mycol'  
  A   t  
  B   2  
  C   NA  
  D   4  
  E   r

I tried this

cbind(data[1], mycol = na.omit(unlist(data[-1])))

But it obviously doesn't keep the NA row.

Suggested dupe: [How to implement coalesce in R?](https://stackoverflow.com/q/19253820/903061) — Gregor Thomas, Aug 22 '17 at 18:37
This case is simple enough you could do `pmin` or `pmax`, e.g., `data$mycol = pmin(data$x, data$y, na.rm = T)`. This scales up to more columns, but it will pick the minimum in the case there are multiple non-missing values. The coalesce answers at the suggested dupe will pick the first non-missing value. — Gregor Thomas, Aug 22 '17 at 18:39
Other related Q for a simple case like here with only one non-`NA` column; [Combine two fitting string columns to one in R](https://stackoverflow.com/questions/27850344/combine-two-fitting-string-columns-to-one-in-r) — Henrik, Aug 22 '17 at 18:46
The coalesce solution linked by @Henrik looks like a good general case. You could also use `data.table` to accomplish something similar, but you lose the NAs in favor of 0s. `setDT(data); data[, new := sum(.SD, na.rm=T), by=a, .SDcols= c("x","y"]` — Mako212, Aug 22 '17 at 18:51
I added an answer based on this. Thanks all for your comments. — Rotail, Aug 22 '17 at 18:58
@Mako212 if you want sums you don't need data table for grouping, just use `rowSums`. The only difference to `pmin` or `pmax` is what happens if there are multiple non-missing values (and the all-NA to 0 as you point out). — Gregor Thomas, Aug 22 '17 at 19:00
Following OP's edit with two string columns, it is now even more similar to the Q&A I linked to... — Henrik, Aug 22 '17 at 19:07

score 3 · Answer 1 · answered Aug 22 '17 at 18:37

3

You could do it by using ifelse, like this:

data$mycol <- ifelse(!is.na(data$x), data$x, data$y)

> data

##   a  x  y mycol
## 1 A  1 NA     1
## 2 B  2 NA     2
## 3 C NA NA    NA
## 4 D NA  4     4
## 5 E NA  5     5

answered Aug 22 '17 at 18:37

Oriol Mirosa

2,756
1
13
15

4

Works well on OP's case, but doesn't scale nicely if there are more columns involved... – Gregor Thomas Aug 22 '17 at 18:39

score 0 · Answer 2 · answered Aug 22 '17 at 18:49

0

Going with your logic, you can do following:

cbind(data[1], mycol = unlist(apply(data[2:3], 1, function(i) ifelse(
  length(is.na(i))==length(i),
  na.omit(i),
  NA)
)))

#  a mycol
#1 A     1
#2 B     2
#3 C    NA
#4 D     4
#5 E     5

answered Aug 22 '17 at 18:49

Patrik_P

3,066
3
22
39

score 0 · Answer 3 · answered Aug 22 '17 at 18:57

0

This has been addressed here indirectly. Here is a simple solution based on that:

data$mycol <- coalesce(data$x, data$y)

answered Aug 22 '17 at 18:57

Rotail

1,025
4
19
40

Nathan Werth · Answer 4 · 2017-08-22T20:20:47.453

Extending the answer to any number of columns, and using the neat max.col() function I've discovered thanks to this question:

coalesce <- function(value_matrix) {
  value_matrix <- as.matrix(value_matrix)
  first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
  indices <- cbind(
    row = seq_len(nrow(value_matrix)),
    col = first_non_missing
  )
  value_matrix[indices]
}

data$mycol <- coalesce(data[, c('x', 'y')])
data
#   a  x  y mycol
# 1 A  1 NA     1
# 2 B  2 NA     2
# 3 C NA NA    NA
# 4 D NA  4     4
# 5 E NA  5     5

max.col(..., ties.method = "first") returns, for each row, the index of the first column with the maximum value. Since we're using it on a logical matrix, the max is usually TRUE. So we'll get the first non-NA value for each row. If the entire row is NA, then we'll get an NA value as desired.

After that, the function uses a matrix of row-column indices to subset the values.

Edit

In comparison to mrip's coalesce, my max.col is slower when there are a few long columns, but faster when there are many short columns.

coalesce_reduce <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
    list(...))
}

coalesce_maxcol <- function(...) {
  value_matrix <- cbind(...)
  first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
  indices <- cbind(
    row = seq_len(nrow(value_matrix)),
    col = first_non_missing
  )
  value_matrix[indices]
}

set.seed(100)

wide <- replicate(
  1000,
  {sample(c(NA, 1:10), 10, replace = TRUE)},
  simplify = FALSE
)

long <- replicate(
  10,
  {sample(c(NA, 1:10), 1000, replace = TRUE)},
  simplify = FALSE
)

microbenchmark(
  do.call(coalesce_reduce, wide),
  do.call(coalesce_maxcol, wide),
  do.call(coalesce_reduce, long),
  do.call(coalesce_maxcol, long)
)
# Unit: microseconds
#                           expr      min        lq       mean   median       uq      max neval
# do.call(coalesce_reduce, wide) 1879.460 1953.5695 2136.09954 2007.303 2152.654 5284.583   100
# do.call(coalesce_maxcol, wide)  403.604  423.5280  490.40797  433.641  456.583 2543.580   100
# do.call(coalesce_reduce, long)   36.829   41.5085   45.75875   43.471   46.942   79.393   100
# do.call(coalesce_maxcol, long)   80.903   88.1475  175.79337   92.374  101.581 3438.329   100

Any advantages compared to [mrip's lightning-fast answer here?](https://stackoverflow.com/a/19254510/903061) — Gregor Thomas, Aug 22 '17 at 19:01
Yes, it scales better to more columns. Adding comparison to answer... — Nathan Werth, Aug 22 '17 at 20:15

How to combine two columns of a data-frame with missing data?

4 Answers4

Linked