Say I have the following data.frame df
of patient hometowns and one arbitrary clinical metric, heart rate:
id <- c(rep(1:3, each = 2), rep(4, 3))
pt_hometown <- c("Atlanta", NA,
NA, "San Diego",
NA, NA,
"San Francisco", "Seattle", NA)
pt_heartrate <- c(NA, 82,
NA, NA,
76, 76,
90, 93, NA)
df <- data.frame(id = id,
pt_hometown = pt_hometown,
pt_heartrate = pt_heartrate,
stringsAsFactors = FALSE)
df
Which gives
id pt_hometown pt_heartrate
1 Atlanta NA
1 <NA> 82
2 <NA> NA
2 San Diego NA
3 <NA> 76
3 <NA> 76
4 San Francisco 90
4 Seattle 93
4 <NA> NA
As I've learned here, summarise_each
can apply one or more functions to a grouped dataframe to collapse records to one per group. The simplest case might be selecting the first non-NA value from all variables within df
and collapsing them down to one per group.
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(first(.[!is.na(.)]))
df1
id pt_hometown pt_heartrate
1 Atlanta 82
2 San Diego NA
3 NA 76
4 San Francisco 90
Of course, for practical applications, one might want to collapse with a bit more specificity. I know how to group df
's variables by type and, for instance, select the max
heart rate per id
and collapse to one record, but what I do not know how to do is conditionally collapse character variables to one record per group, given there is only one unique non-NA value.
More concretely, consider the patient with id
number 4. They have two unique values for pt_hometown
, "San Francisco" and "Seattle". Obviously both cannot be correct. So I would like to collapse records for each group where there is only one non-NA value, but retain rows where multiple non-NA elements exist and then bring it to the attention of our group to decide how to correct the mistake in the original dataset.
So I'd like df1
to look like this:
id pt_hometown pt_heartrate
1 Atlanta 82
2 San Diego NA
3 <NA> 76
4 San Francisco 90
4 Seattle 93
This is what I've tried:
df1 <- df %>%
group_by(id) %>%
summarise_each_(funs(first(.[!is.na(.)])), df[length(unique(.[!is.na(.)])) == 1])