I am trying to collapse string data from multiple columns into one summary column, however I wish to ignore NA's and duplicates. The data is varies, and duplicates and number of NAs vary across the data set and columns. I have pasted a reproducable data example below:
# example data
data <- data.frame(id = c(1:4), A = c("one", "two", "threeA"), B = c(NA, "two", "threeB"), C = c(NA, NA, NA))
data
id A B C
1 1 <NA> <NA> NA
2 2 one <NA> NA
3 3 two two NA
4 4 threeA threeB NA
# desired result
id D
1 1 <NA>
2 2 one
3 3 two
4 4 threeA, threeB
I am really having trouble achieving both of these things (removing dupes and NA's) at the same time.
I'm using dplyr
mostly, though am open to other suggestions. I've tried just using paste
and mutate
:
data %>%
transmute(id, D = paste( A, B, C , sep = ", ") )
However that pastes 'NA'
text for NA
values and pastes duplicates - and the answer here seems overly complicated and expensive (I have tens of thousands of rows).
I have tried using a combination of group_by()
and summarise()
in dplyr
as well as unique()
(having to unlist the data first)) and paste()
as follows:
data %>%
group_by(id) %>%
summarize(D = paste( unique( unlist( A, B, C )) , sep = ", ")
)
This also pastes NAs, but I can easily remove this later, however it only pastes the FIRST unique value found, not all unique values.
Any suggestions on how to achieve this?