11

Say I have the following dataframe df

name <- c("Bill", "Rob", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave")
favteam <- c("Dodgers", "Mets", "Pirates", NA)

df <- data.frame(name = name, 
                 address = address,
                 favteam = favteam)
df

Which looks like:

  name       address favteam
1 Bill   123 Main St Dodgers
2  Rob  234 Broad St    Mets
3  Joe          <NA> Pirates
4  Joe 456 North Ave    <NA>

What I want to do is collapse (coalesce) rows by name (or in general, any number of grouping variables) and have any other value than NA replace the NA value in the final data, like so:

df_collapse <- foo(df)

  name   address        favteam
1 Bill   123 Main St    Dodgers
2  Rob   234 Broad St      Mets
3  Joe   456 North Ave  Pirates
Henrik
  • 65,555
  • 14
  • 143
  • 159
mcjudd
  • 1,520
  • 2
  • 18
  • 33
  • 5
    Can Joe change his mind about his team, or correct his address on the second or subsequent records? – vpipkt Feb 13 '15 at 22:06
  • Joe lives off the grid and actively avoids data collectors. We were only able to track him down twice and he was vehemently opposed to telling us anything about his life, but fortunately he loves talking baseball so... – mcjudd Feb 13 '15 at 22:14
  • Jokes aside, varying non-NA responses add another wrinkle to my problem, but I figured I would take it one step at a time. Any thoughts on implementing a hierarchy based on order of levels in a factor variable? That sort of question doesn't seem to have been answered on SO... – mcjudd Feb 13 '15 at 22:17
  • What do you mean by "implementing a hierarchy based on order of levels in a factor variable"? Can you explain a bit more detailed? What do you expect in cases of varying non-NA responses per name? – talat Feb 13 '15 at 22:25

1 Answers1

16

Here's an option with dplyr:

library(dplyr)

df %>%
  group_by(name) %>%
  summarise_each(funs(first(.[!is.na(.)]))) # or summarise_each(funs(first(na.omit(.))))

#Source: local data frame [3 x 3]
#
#  name       address favteam
#1 Bill   123 Main St Dodgers
#2  Joe 456 North Ave Pirates
#3  Rob  234 Broad St    Mets

And with data.table:

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by = name]
#   name       address favteam
#1: Bill   123 Main St Dodgers
#2:  Rob  234 Broad St    Mets
#3:  Joe 456 North Ave Pirates

Or

setDT(df)[, lapply(.SD, function(x) head(na.omit(x), 1L)), by = name]

Edit:

You say in your actual data you have varying numbers of non-NA responses per name. In that case, the following approach may be helpful.

Consider this modified sample data (look at last row):

name <- c("Bill", "Rob", "Joe", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave", "123 Boulevard")
favteam <- c("Dodgers", "Mets", "Pirates", NA, NA)

df <- data.frame(name = name, 
                 address = address,
                 favteam = favteam)

df
#  name       address favteam
#1 Bill   123 Main St Dodgers
#2  Rob  234 Broad St    Mets
#3  Joe          <NA> Pirates
#4  Joe 456 North Ave    <NA>
#5  Joe 123 Boulevard    <NA>

Then, you can use this data.table approach to get the non-NA responses that can be varying in number by name:

setDT(df)[, lapply(.SD, function(x) unique(na.omit(x))), by = name]
#   name       address favteam
#1: Bill   123 Main St Dodgers
#2:  Rob  234 Broad St    Mets
#3:  Joe 456 North Ave Pirates
#4:  Joe 123 Boulevard Pirates
talat
  • 68,970
  • 21
  • 126
  • 157
  • Great answer -- thank you for your attention to detail, @docendo. I'm particularly drawn to the `dplyr` solution since it's been on my list to become more familiar with that package's syntax. For this part: `summarise_each(funs(first(.[!is.na(.)])))`, is the period a short-hand way of referring to `df` grouped by `name`? I was unaware `dplyr` played nice with indices. Also, would be very appreciative if you could refer me to a thorough tutorial on the finer points of `dplyr`. – mcjudd Feb 14 '15 at 02:47
  • @mcjudd, glad it worked :) The `.` in `summarise_each` refers to the current data which is a) grouped and b) column-wise. So `first(.[!is.na(.)])` means: in each column we summarise and each group of `name` in that column, take the first data point that is not `NA` and return it as the summarised value in that column for that group. Unfortunately I can't really tell you much about dplyr tutorials. You'll find many if you just google it, for example [this one by Hadley](http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/). – talat Feb 14 '15 at 11:03
  • @mcjudd, I'm not sure I understand exactly how you mean it but you could try to extend the dplyr pipe with `... %>% mutate_each(funs(replace(., which(. == 0), 1))`. You could also use ifelse but replace is faster. Technically you could also do that inside the summarize each but it would make it less readable and more importantly you don't need to do this by group (and after the summarize the data is no longer grouped so it will be better with mutate each afterwards). – talat Feb 17 '15 at 20:50
  • Sorry, silly question. Figured it out. `summarise_each(funs(first(.[!is.na(.)]), max))` :) – mcjudd Feb 17 '15 at 20:58
  • Interesting though -- once you apply more than one function, the columns that you applied the functions to take on the name of the function as a suffix... e.g., `var1_max`, `var2_first`. Can this behavior be disabled? – mcjudd Feb 17 '15 at 21:12
  • What would you want instead? This is intended behavior in order to avoid duplicated column names which makes a lot of sense usually. – talat Feb 17 '15 at 21:14
  • 1
    @mcjudd, Also see https://stackoverflow.com/questions/27027347/mutate-each-in-dplyr-how-do-i-select-certain-columns-and-give-new-names-to-muta for a related question – talat Feb 17 '15 at 21:21