0

I have a data frame that looks like:

entry color value1 value2
A     blue  1      0
A     green 1      0
B     green 1      0
C     red   0      1

I want to combine the rows that have the same value for all columns other than color for the output:

entry color        value1 value2
A     blue, green  1      0
B     green        1      0
C     red          0      1

I have attempted to use plyr and ddply as suggested in answer to this question https://www.biostars.org/p/167028/. This code produces the correct output for this minimal worked example. However, my actual dataset has 600 columns so this is not a viable solution. Does anyone have any suggestions that will work with larger datasets?

df2 <-ddply(df, .(entry), summarize,
         color=paste(unique(color),collapse=","), 
         value1=paste(unique(value1),collapse=",") ,
         value2= paste(unique(value2),collapse=","))

The example data frame is:

df <- data.frame("entry" = c("A", "A", "B", "C"),"color" = c("blue", "green", "green", "red"), "value1" = c(1,1,1,0), "value2" = c(0,0,0,1))

Edit: to solve the scalability problem I adapted the below answer and Group by multiple columns in dplyr, using string vector input

First create a vector of column names, minus the column of interest, using its index:

cnames <- names(df)[-2]

Then using the group_by_at function from dplyr:

df %>%
 group_by_at(vars(one_of(cnames))) %>%
 summarise(color=paste(unique(color), collapse=",")) %>% 
 ungroup()
RAS
  • 121
  • 3
  • 10

1 Answers1

2

You can try a tidyverse

library(tidyverse)
df %>% 
  group_by(entry, value1, value2) %>% 
  summarise(color=paste(unique(color), collapse=",")) %>% 
  ungroup()
# A tibble: 3 x 4
  entry value1 value2 color     
  <fct>  <dbl>  <dbl> <chr>     
1 A          1      0 blue,green
2 B          1      0 green     
3 C          0      1 red 

Remove the unique if you want to list duplicates as well like this summarise(color=toString(color))

Roman
  • 17,008
  • 3
  • 36
  • 49
  • 3
    FYI `paste(x, collapse = ', ')` is wrapped in `toString(x)` – Sotos Nov 13 '18 at 13:32
  • thanks for the hint...forgot this function :) – Roman Nov 13 '18 at 13:34
  • 1
    This does work for the small number of columns in the minimal example, but was not scalable for 600 columns. Instead selecting a vector of column names and `group_by_at` solved the scalability problem. – RAS Nov 13 '18 at 16:28