0

I am trying to collapse string data from multiple columns into one summary column, however I wish to ignore NA's and duplicates. The data is varies, and duplicates and number of NAs vary across the data set and columns. I have pasted a reproducable data example below:

# example data 
data <- data.frame(id = c(1:4), A = c("one", "two", "threeA"),  B = c(NA, "two", "threeB"), C = c(NA, NA, NA))
data

  id      A      B  C
1  1   <NA>   <NA> NA
2  2    one   <NA> NA
3  3    two    two NA
4  4 threeA threeB NA

# desired result
  id              D
1  1           <NA>
2  2            one
3  3            two
4  4 threeA, threeB

I am really having trouble achieving both of these things (removing dupes and NA's) at the same time.

I'm using dplyr mostly, though am open to other suggestions. I've tried just using paste and mutate:

data %>%
  transmute(id, D = paste( A, B, C , sep = ", ") )

However that pastes 'NA' text for NA values and pastes duplicates - and the answer here seems overly complicated and expensive (I have tens of thousands of rows).

I have tried using a combination of group_by() and summarise() in dplyr as well as unique() (having to unlist the data first)) and paste() as follows:

data %>%
  group_by(id) %>%
  summarize(D = paste( unique( unlist(  A, B, C )) , sep = ", ") 
  )

This also pastes NAs, but I can easily remove this later, however it only pastes the FIRST unique value found, not all unique values.

Any suggestions on how to achieve this?

lmo
  • 37,904
  • 9
  • 56
  • 69
renegademonkey
  • 457
  • 1
  • 7
  • 18
  • na.omit does not work in this case (I do not want to apply this to ALL columns in the df, just a select subset) - plus I also need to remove duplicates at the same time. – renegademonkey Mar 20 '17 at 13:02
  • 1
    Could try `data %>% gather(variable, value, -id) %>% group_by(id) %>% summarise(toString(unique(value[!is.na(value)])))` or such – David Arenburg Mar 20 '17 at 13:04
  • Hi @DavidArenburg. what is gather and what does it do? it's not a dplyr or base function. – renegademonkey Mar 20 '17 at 13:08
  • It's from `tidyr`- yet another Hadleyverse package. It converts a wide data to a long format – David Arenburg Mar 20 '17 at 13:09
  • 2
    @renegademonkey works for me with your given data `apply(data[-1], 1, function(x) toString(unique(na.omit(x))))` – Ronak Shah Mar 20 '17 at 13:14
  • @RonakShah that's it. I had tried something similar before but must have gotten the wrong syntax. I like that it doesn't require additional packages. – renegademonkey Mar 20 '17 at 13:40

0 Answers0