10

Consider the following simple example

group <-c('A','A','A','B','B','B','B')
names<- c(NA,'fred',NA,'josh','josh',NA,NA)
data=data_frame(group,names)

> data
# A tibble: 7 × 2
  group names
  <chr> <chr>
1     A  <NA>
2     A  fred
3     A  <NA>
4     B  josh
5     B  josh
6     B  <NA>
7     B  <NA>

Here, I would like to get, for each group the first non missing name in names. How can I do that? The solution below using coalesce and first fail.

data %>% group_by(group) %>% mutate(first_non_missing = first(names),
                                    first_non_missing_alt = coalesce(names)) %>% ungroup()

# A tibble: 7 × 4
  group names first_non_missing first_non_missing_alt
  <chr> <chr>             <chr>                 <chr>
1     A  <NA>              <NA>                  <NA>
2     A  fred              <NA>                  fred
3     A  <NA>              <NA>                  <NA>
4     B  josh              josh                  josh
5     B  josh              josh                  josh
6     B  <NA>              josh                  <NA>
7     B  <NA>              josh                  <NA>

Indeed, for group A, first_non_missing should be fred for all three observations..

Many thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

17

Summarise will give one entry per group, here, finding the first non-missing using which

data %>%
  group_by(group) %>%
  summarise(first_non_missing = names[which(!is.na(names))[1]])

gives

  group first_non_missing
  <chr>             <chr>
1     A              fred
2     B              josh

If you still want all of the rows, replace summarise with mutate.

Mark Peterson
  • 9,370
  • 2
  • 25
  • 48
  • yes that makes sense. Thanks! Besides, I dont know what `coalesce` is supposed to do given that it does not work here... – ℕʘʘḆḽḘ Nov 09 '16 at 20:17
  • 1
    Glad it works. I believe that `coalesce` is designed to be used *across* columns, not within them. That is, if you had three `name` columns, it would return the first one with a non-missing value for each row. – Mark Peterson Nov 09 '16 at 21:09
  • 2
    It's utility is actually larger than you might suspect. Imagine a case where you had a form that allowed respondents to include their "preferred" name in addition to their first name. If you are addressing them, you want to use the preferred name, but only if it is included. So, you might create a column `data %>% mutate(toAddress = coalesce(preferred, first) )` – Mark Peterson Nov 10 '16 at 12:28