I have a data frame that looks like:
entry color value1 value2
A blue 1 0
A green 1 0
B green 1 0
C red 0 1
I want to combine the rows that have the same value for all columns other than color for the output:
entry color value1 value2
A blue, green 1 0
B green 1 0
C red 0 1
I have attempted to use plyr
and ddply
as suggested in answer to this question https://www.biostars.org/p/167028/. This code produces the correct output for this minimal worked example. However, my actual dataset has 600 columns so this is not a viable solution. Does anyone have any suggestions that will work with larger datasets?
df2 <-ddply(df, .(entry), summarize,
color=paste(unique(color),collapse=","),
value1=paste(unique(value1),collapse=",") ,
value2= paste(unique(value2),collapse=","))
The example data frame is:
df <- data.frame("entry" = c("A", "A", "B", "C"),"color" = c("blue", "green", "green", "red"), "value1" = c(1,1,1,0), "value2" = c(0,0,0,1))
Edit: to solve the scalability problem I adapted the below answer and Group by multiple columns in dplyr, using string vector input
First create a vector of column names, minus the column of interest, using its index:
cnames <- names(df)[-2]
Then using the group_by_at
function from dplyr
:
df %>%
group_by_at(vars(one_of(cnames))) %>%
summarise(color=paste(unique(color), collapse=",")) %>%
ungroup()