You do this with tidyr
and dplyr
using essentially two steps:
spread
( tidyr
) to create new columns with the names in B
and values in C
group_by
your ID
column and then summarize_each
of the remaining columns (dplyr
) with a custom summary function to deal with either character or numeric columns
Here's the full solution, which requires a few more commands for bookkeeping.
1. spread. One gotcha with spread is that all rows have to have a unique id in some column(see here)
library(dplyr)
library(tidyr)
## spread with 'convert' will convert types automatically
spread_data <- dat %>%
mutate(row= 1:nrow(dat)) %>% ## uniquely id rows
spread(B, C, convert = TRUE) %>%
select(-row)
## converting the data.frame to a tbl_df
## lets us easily see the cols are different types
tbl_df(spread_data)
##Source: local data frame [10 x 5]
##
## ID Blue Green Red Yellow
## (int) (int) (chr) (int) (chr)
## 1 1539 8 NA NA NA
## 2 1539 4 NA NA NA
## 3 1539 NA NA 9 NA
## 4 1539 NA NA 13 NA
## 5 1539 NA NA NA NCAA
## 6 3574 NA RA NA NA
## 7 3574 NA RA NA NA
## 8 3574 NA RA NA NA
## 9 3574 NA NA 1 NA
## 10 3574 NA NA 1 NA
2. group and summarize. first, we need to write a function that will handle either type of columns
summarizer <- function(x) {
if (is.numeric(x)) {
sum(x, na.rm = TRUE)
} else {
# assume x is a character
if (all(is.na(x)))
return("-")
x[is.na(x)] <- ""
x <- unique(x)
paste0(x, collapse="")
}
}
## summarize each applies summarizer to the columns that aren't being used
## for grouping (so, not ID in this case)
spread_data %>% group_by(ID) %>%
summarize_each(funs(summarizer))
##Source: local data frame [2 x 5]
##
## ID Blue Green Red Yellow
## (int) (int) (chr) (int) (chr)
## 1 1539 12 - 22 NCAA
## 2 3574 0 RA 2 -