Dplyr : how to find the first-non missing string by groups?

Question

Consider the following simple example

group <-c('A','A','A','B','B','B','B')
names<- c(NA,'fred',NA,'josh','josh',NA,NA)
data=data_frame(group,names)

> data
# A tibble: 7 × 2
  group names
  <chr> <chr>
1     A  <NA>
2     A  fred
3     A  <NA>
4     B  josh
5     B  josh
6     B  <NA>
7     B  <NA>

Here, I would like to get, for each group the first non missing name in names. How can I do that? The solution below using coalesce and first fail.

data %>% group_by(group) %>% mutate(first_non_missing = first(names),
                                    first_non_missing_alt = coalesce(names)) %>% ungroup()

# A tibble: 7 × 4
  group names first_non_missing first_non_missing_alt
  <chr> <chr>             <chr>                 <chr>
1     A  <NA>              <NA>                  <NA>
2     A  fred              <NA>                  fred
3     A  <NA>              <NA>                  <NA>
4     B  josh              josh                  josh
5     B  josh              josh                  josh
6     B  <NA>              josh                  <NA>
7     B  <NA>              josh                  <NA>

Indeed, for group A, first_non_missing should be fred for all three observations..

Many thanks!

What is the expected result? Should the found name populate all values in a new variable in a given group? — Roman Luštrik, Nov 09 '16 at 20:14
yes its just a regular mutate. same size of original dataframe, and a new column that just repeats the first non missing string in `names` for each group — ℕʘʘḆḽḘ, Nov 09 '16 at 20:16
Maybe use `na.omit`? `data %>% group_by(group) %>% summarise(first_non_missing = na.omit(names)[1])` — Jaap, Nov 09 '16 at 20:37

score 17 · Accepted Answer · answered Nov 09 '16 at 20:16

17

Summarise will give one entry per group, here, finding the first non-missing using which

data %>%
  group_by(group) %>%
  summarise(first_non_missing = names[which(!is.na(names))[1]])

gives

  group first_non_missing
  <chr>             <chr>
1     A              fred
2     B              josh

If you still want all of the rows, replace summarise with mutate.

answered Nov 09 '16 at 20:16

Mark Peterson

9,370
2
25
48

yes that makes sense. Thanks! Besides, I dont know what `coalesce` is supposed to do given that it does not work here... – ℕʘʘḆḽḘ Nov 09 '16 at 20:17
1

Glad it works. I believe that `coalesce` is designed to be used *across* columns, not within them. That is, if you had three `name` columns, it would return the first one with a non-missing value for each row. – Mark Peterson Nov 09 '16 at 21:09
2

It's utility is actually larger than you might suspect. Imagine a case where you had a form that allowed respondents to include their "preferred" name in addition to their first name. If you are addressing them, you want to use the preferred name, but only if it is included. So, you might create a column `data %>% mutate(toAddress = coalesce(preferred, first) )` – Mark Peterson Nov 10 '16 at 12:28

Dplyr : how to find the first-non missing string by groups?

1 Answers1

Linked