3

Say I have the following data.frame df of patient hometowns and one arbitrary clinical metric, heart rate:

id          <- c(rep(1:3, each = 2), rep(4, 3))
pt_hometown <- c("Atlanta", NA, 
                 NA, "San Diego", 
                 NA, NA, 
                 "San Francisco", "Seattle", NA)
pt_heartrate <- c(NA, 82,
                  NA, NA,
                  76, 76,
                  90, 93, NA)

df <- data.frame(id = id, 
                 pt_hometown = pt_hometown,
                 pt_heartrate = pt_heartrate,
                 stringsAsFactors = FALSE)
df

Which gives

id   pt_hometown pt_heartrate
 1       Atlanta           NA
 1          <NA>           82
 2          <NA>           NA
 2     San Diego           NA
 3          <NA>           76
 3          <NA>           76
 4 San Francisco           90
 4       Seattle           93
 4          <NA>           NA

As I've learned here, summarise_each can apply one or more functions to a grouped dataframe to collapse records to one per group. The simplest case might be selecting the first non-NA value from all variables within df and collapsing them down to one per group.

  df1 <- df %>%  
    group_by(id) %>%
    summarise_each(funs(first(.[!is.na(.)]))

df1

id   pt_hometown pt_heartrate
 1       Atlanta           82
 2     San Diego           NA
 3            NA           76
 4 San Francisco           90

Of course, for practical applications, one might want to collapse with a bit more specificity. I know how to group df's variables by type and, for instance, select the max heart rate per id and collapse to one record, but what I do not know how to do is conditionally collapse character variables to one record per group, given there is only one unique non-NA value.

More concretely, consider the patient with id number 4. They have two unique values for pt_hometown, "San Francisco" and "Seattle". Obviously both cannot be correct. So I would like to collapse records for each group where there is only one non-NA value, but retain rows where multiple non-NA elements exist and then bring it to the attention of our group to decide how to correct the mistake in the original dataset.

So I'd like df1 to look like this:

id   pt_hometown pt_heartrate
 1       Atlanta           82
 2     San Diego           NA
 3          <NA>           76
 4 San Francisco           90
 4       Seattle           93

This is what I've tried:

df1 <- df %>%  
  group_by(id) %>%
  summarise_each_(funs(first(.[!is.na(.)])), df[length(unique(.[!is.na(.)])) == 1])
Community
  • 1
  • 1
mcjudd
  • 1,520
  • 2
  • 18
  • 33
  • 1
    To my knowledge, `summarise` and `summarise_each` currently only allow 1 row per group returned. The developers [are discussing](https://github.com/hadley/dplyr/issues/154) how to make it more flexible. – talat Feb 23 '15 at 19:29

2 Answers2

4

I'm a little unclear on what you want for some edge cases, but this works for the OP:

library(data.table)
dt = as.data.table(df) # or convert in place using setDT

unique(dt, by = c('id', 'pt_hometown'))[, lapply(.SD, na.omit), by = id]
#   id   pt_hometown pt_heartrate
#1:  1       Atlanta           82
#2:  2     San Diego           NA
#3:  3            NA           76
#4:  4 San Francisco           90
#5:  4       Seattle           93
eddi
  • 49,088
  • 6
  • 104
  • 155
3

As commented above, there is currently no way to use dplyr::summarise_each with variable number of rows to be returned.

If you want to go on using dplyr, you could circumvent this by using mutate_each and distinct.

Here's an example:

f <- function(.) if(length(unique(.[!is.na(.)])) > 1L) . else first(.[!is.na(.)]) 

df %>% 
  group_by(id) %>%
  mutate_each(funs(f)) %>%
  ungroup() %>%
  distinct() %>% 
  filter(rowSums(is.na(.)) < 2L)     # assuming you don't have NAs in the ID column

#Source: local data frame [5 x 3]
#
#  id   pt_hometown pt_heartrate
#1  1       Atlanta           82
#2  2     San Diego           NA
#3  3            NA           76
#4  4 San Francisco           90
#5  4       Seattle           93

However, the data.table approach in my answer to your previous question or that by eddi would probably be more efficient.

talat
  • 68,970
  • 21
  • 126
  • 157
  • Beautiful. Thanks again for sticking with these questions. It's on my list to branch out to `data.table` one of these days. Now if I could only find the time... :) – mcjudd Feb 24 '15 at 02:53
  • Another question -- say that instead of only 1 character variable in this data.frame `df` (`pt_hometown`), I had dozens, and I only wanted to select those variables instead of the whole `df`. I know I would use a `select` function at the end of the `%>%` chain. However, how can I pass a character vector containing the names of the character columns in `df` (named `dfcharvars`, for example) to `select`? I tried `select_(dfcharvars)` but it only selects the first element of that vector. Thanks. – mcjudd Feb 26 '15 at 17:06
  • Figured out a work around, but would still like to know how to use `select`... what I did was just use `%>% .[dfcharvars]` at the end. Which seems too simple... – mcjudd Feb 26 '15 at 17:16