31

I know this is a duplicate Q but I can't seem to find the post again

Using the following data

df <- data.frame(A=c(1,1,2,2),B=c(NA,2,NA,4),C=c(3,NA,NA,5),D=c(NA,2,3,NA),E=c(5,NA,NA,4))

  A  B  C  D  E
  1 NA  3 NA  5
  1  2 NA  2 NA
  2 NA NA  3 NA
  2  4  5 NA  4

Grouping by A, I'd like the following output using a tidyverse solution

  A  B  C  D  E
  1  2  3  2  5
  2  4  5  3  4

I have many groups in A. I think I saw an answer using coalesce but am unsure how to get it work. I'd like a solution that works with characters as well. Thanks!

CPak
  • 13,260
  • 3
  • 30
  • 48
  • 1
    Possible duplicate of [Combine rows by group with differing NAs in each row](https://stackoverflow.com/questions/45201654/combine-rows-by-group-with-differing-nas-in-each-row) – BENY Aug 04 '17 at 21:01
  • `coalesce(df[1,], df[2,])`. – Rui Barradas Aug 04 '17 at 21:02
  • Thanks Wen: similar approach to `d.b.`'s answer, which should work, but I'm wondering if there's also a solution involving `coalesce` – CPak Aug 04 '17 at 21:02
  • Thanks Rui: please see my update. I have many groups in `A`. I'd like a `dplyr` chained solution for that reason – CPak Aug 04 '17 at 21:06
  • Does this answer your question? https://stackoverflow.com/questions/40515180/dplyr-how-to-find-the-first-non-missing-string-by-groups – Alex Aug 04 '17 at 21:17
  • Possible duplicate of [How can I remove all cells with "NA" value by columns](https://stackoverflow.com/questions/45376531/how-can-i-remove-all-cells-with-na-value-by-columns) – Uwe Aug 04 '17 at 22:18
  • @RuiBarradas `coalesce(df[1,], df[2,])` returns only one row but not two as requested. – Uwe Aug 04 '17 at 22:26
  • Some more solutions in this question https://stackoverflow.com/q/28036294/786542 – Tung Sep 10 '18 at 19:14
  • `coalesce` is a bad option here, since it really only takes the first non-missing value, and won't perform any of the grouping you want. In other words, `coalesce` will work if you have identified the rows that are matches for each other, excepting NAs, but otherwise is not a solution. – Dannid Dec 05 '18 at 22:45

7 Answers7

24

I haven't figured out how to put the coalesce_by_column function inside the dplyr pipeline, but this works:

coalesce_by_column <- function(df) {
  return(coalesce(df[1], df[2]))
}

df %>%
  group_by(A) %>%
  summarise_all(coalesce_by_column)

##       A     B     C     D     E
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     2     3     2     5
## 2     2     4     5     3     4

Edit: include @Jon Harmon's solution for more than 2 members of a group

# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
  return(dplyr::coalesce(!!! as.list(df)))
}

df %>%
  group_by(A) %>%
  summarise_all(coalesce_by_column)

#> # A tibble: 2 x 5
#>       A     B     C     D     E
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     2     5
#> 2     2     4     5     3     4
Tung
  • 26,371
  • 7
  • 91
  • 115
Oriol Mirosa
  • 2,756
  • 1
  • 13
  • 15
  • Please see my updated example and I have edited my post to indicate that I have many groups in `A`, which is why I wanted a `dplyr` chained solution – CPak Aug 04 '17 at 21:08
  • 6
    From the dplyr::coalesce help: # Supply lists by splicing them into dots: vecs <- list( c(1, 2, NA, NA, 5), c(NA, NA, 3, 4, 5) ) coalesce(!!! vecs) So, based on that, I got coalesce_by_column to work for more than 2 members of a group: coalesce_by_column <- function(df) { return(dplyr::coalesce(!!! as.list(df))) } Hope that helps! – Jon Harmon Feb 12 '18 at 15:11
  • looks like this does not work on character columns. Just value. – user5249203 Oct 19 '20 at 17:56
16

We can use fill to fill all the missing values. And then filter just one row for each group.

library(dplyr)
library(tidyr)

df2 <- df %>%
  group_by(A) %>%
  fill(everything(), .direction = "down") %>%
  fill(everything(), .direction = "up") %>%
  slice(1)

And thanks to @Roger-123, the above code can be further simplified as follows.

df2 <- df %>%
  group_by(A) %>%
  fill(everything(), .direction = "downup") %>%
  slice(1)
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    This is the best answer that is simple AND provides a tidyverse based answer, as the original question requested. – Roger-123 Sep 17 '21 at 19:10
  • 2
    This could also be simplified by changing the first fill line from `"down"` to `"downup"` and removing the second fill line. – Roger-123 Sep 20 '21 at 16:06
  • 1
    @Roger-123 Thanks for your comment. I just updated the post including your suggestion. – www Sep 21 '21 at 12:25
5

Not tidyverse but here's one base R solution

df <- data.frame(A=c(1,1),B=c(NA,2),C=c(3,NA),D=c(NA,2),E=c(5,NA))
sapply(df, function(x) x[!is.na(x)][1])
#A B C D E 
#1 2 3 2 5 

With updated data

do.call(rbind, lapply(split(df, df$A), function(a) sapply(a, function(x) x[!is.na(x)][1])))
#  A B C D E
#1 1 2 3 2 5
#2 2 4 5 3 4
d.b
  • 32,245
  • 6
  • 36
  • 77
  • Thanks, this could work in a `mutate` statement. I'll wait for a few other answers nonetheless – CPak Aug 04 '17 at 21:01
4

Here is an even more general solution (using unique, na.omit to sort of create coalesce), which can handle more than two rows with overlapping information. Super simply and forward.

> df <- data.frame(A=c(1,1,2,2,2),B=c(NA,2,NA,4,4),C=c(3,NA,NA,5,NA),D=c(NA,2,3,NA,NA),E=c(5,NA,NA,4,4))

> df
  A  B  C  D  E
1 1 NA  3 NA  5
2 1  2 NA  2 NA
3 2 NA NA  3 NA
4 2  4  5 NA  4
5 2  4 NA NA  4

> df %>% group_by(A) %>% summarise_all(funs( na.omit(unique(.)) ))
# A tibble: 2 x 5
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     2     5
2     2     4     5     3     4
Jerry T
  • 1,541
  • 1
  • 19
  • 17
2

A simple way is to summarise by taking the max across all columns for each group with tidyverse:

library(tidyverse)

df %>% 
  group_by(A) %>% 
  summarise(across(everything(), ~ max(., na.rm = T)))

Output

      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     2     5
2     2     4     5     3     4

Another option with using na.omit and unique:

df %>% 
  group_by(A) %>% 
  summarise(across(everything(), ~ na.omit(unique(.))))
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
1

A different tidyverse possibility could be:

df %>%
 gather(var, val, -A, na.rm = TRUE) %>%
 group_by(A, var) %>%
 distinct(val) %>%
 spread(var, val)

      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     2     5
2     2     4     5     3     4

Here it, first, performs a wide-to-long data-transformation, excluding the "A" column and removing the missing values. Second, it groups by "A" column and the variable names. Third, it removes the duplicate values. Finally, it returns the data to its original wide format.

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0

This is functionally identical to @Oriol Mirosa's answer without requiring a custom function:

EDIT: NAs must be omitted as per @thelatemail's comment. This answer was also given by @MrFlick in the duplicate thread linked above.

df %>% group_by(A) %>% summarise_all(~first(na.omit(.)))

I wanted to add to this as it seems to come up regularly for me and I've revisited this thread many times. @Oriol Mirosa's answer works, however I'm resistant to it because it's just complex enough to be difficult to remember (hence my return to this thread).

Personally, I also don't like writing small custom functions like if I don't need to. Attempting to substitute coalesce_by_column with the actual coalesce call results in type errors (which I find strange as the rows aren't interacting with each other but whatever). This can be resolved by first doing mutate_all(as.character), however my goal here is to minimize syntax so it's easily remembered on the fly.

Furthermore, this substitution changes the behavior such that non-identical values within a column throws an error (why things sometimes behave slightly differently within a function is beyond me). This behavior may be preferred in some situations, however in that case I would recommend @Jerry T's solution as there is no custom function and the ones used are familiar, readable, and the ordering of them (na.omit and unique) isn't relevant.