2

I'm having a problem where I normally collapse string data in a column for multiple rows into one column, but for some reason the code isn't doing what I expect.

My data looks like:

       Genes Source      Type
   1:   LZIC Source1 Secondary
   2:   LZIC Source2      Lead
   3:  KIF1B Source1 Secondary
   4:  CASZ1 Source1 Secondary
   5:  CASZ1 Source4 Secondary

I want to compress by Genes and I do this with code from similar questions on this site e.g.:

source  <- df %>%
  group_by(Genes) %>%
  summarize(text = str_c(Source, collapse = ", "))

type <- df %>%
  group_by(Genes) %>%
  summarize(text = str_c(Type, collapse = ", "))

However, the output for these doesn't look like what I expect it to, for each variable I create I'm getting one row with all the sources or types as a string and nothing else.

The output I'm trying to get is:

        Genes   Source               Type
   1:   LZIC    Source1, Source1     Secondary, Lead
   2:  KIF1B    Source1              Secondary
   3:  CASZ1    Source1, Source4     Secondary, Secondary

Is there something wrong with my code? It's worked for me in other cases. I've tried also modifying the code to do the 2 columns compression at once and failed with that separately.

Input data:

structure(list(Genes = c("LZIC", "CDC14A", "KIF1B", "CASZ1", 
"CASZ1"), Source = c("BPICE_UKBfinemapCommon(1210)", "GxL_Fuentes_Educ_T2nov_TRANS", 
"BPICE_UKBfinemapCommon(1210)", "BPICE_UKBfinemapCommon(1210)", 
"BPICE_UKBfinemapCommon(1210)"), Type = c("Secondary", "Lead", 
"Secondary", "Secondary", "Secondary")), row.names = c(NA, -5L
), class = c("data.table", "data.frame"))
DN1
  • 234
  • 1
  • 13
  • 38

1 Answers1

4

Try this. You can use dplyr grouping by Genes and then use summarise_all() and toString() functions in order to have the expected outcome. Here the code where I have used the data you shared as df:

library(dplyr)
#Code
newdf <- df %>% group_by(Genes) %>% summarise_all(toString)

Output:

# A tibble: 3 x 3
  Genes Source           Type                
  <chr> <chr>            <chr>               
1 CASZ1 Source1, Source4 Secondary, Secondary
2 KIF1B Source1          Secondary           
3 LZIC  Source1, Source2 Secondary, Lead     

Or using base R:

#Code2
newdf <- aggregate(cbind(Source,Type)~Genes,df,toString)

Output:

  Genes           Source                 Type
1 CASZ1 Source1, Source4 Secondary, Secondary
2 KIF1B          Source1            Secondary
3  LZIC Source1, Source2      Secondary, Lead
Duck
  • 39,058
  • 13
  • 42
  • 84