2

I am looking at gene ontology, having this dataframe:

> head(BT_Ctrl_go_terms, 13)
# A tibble: 13 x 4
   go_term        n gene     go_name                 
   <chr>      <int> <chr>    <chr>                   
 1 GO:0001525    15 NRP1     angiogenesis            
 2 GO:0001525    15 ANG      angiogenesis            
 3 GO:0001525    15 THY1     angiogenesis            
 4 GO:0001525    15 ATP5F1B  angiogenesis            
 5 GO:0001525    15 ECM1     angiogenesis            
 6 GO:0001666     6 ANG      response to hypoxia     
 7 GO:0001666     6 CAT      response to hypoxia     
 8 GO:0001666     6 HSP90B1  response to hypoxia     
 9 GO:0002250     8 IGKV1-27 adaptive immune response
10 GO:0002250     8 IGHV3-21 adaptive immune response
11 GO:0002250     8 TNFRSF21 adaptive immune response
12 GO:0002250     8 IGLV2-11 adaptive immune response
13 GO:0002250     8 IGHV4-34 adaptive immune response

I need to arrange data so that each go_name is listed on a row one time. Then, I need a new covariate genes that lists all BT_Ctrl_go_term$gene that belongs to the corresponding BT_Ctrl_go_term$go_name. Each gene name must be separated by ,.

Expected output:

     go_term  n                  go_name                                            genes
1 GO:0001525 15             angiogenesis                   NRP1, ANG, THY1, ATP5F1B, ECM1
2 GO:0001666  6      response to hypoxia                                ANG, CAT, HSP90B1
3 GO:0002250  8 adaptive immune response IGKV1-27, IGHV3-21, TNFRSF21, IGLV2-11, IGHV4-34

A dplyr solution is preferable.

Data

BT_Ctrl_go_term <- structure(list(go_term = c("GO:0001525", "GO:0001525", "GO:0001525", 
"GO:0001525", "GO:0001525", "GO:0001666", "GO:0001666", "GO:0001666", 
"GO:0002250", "GO:0002250", "GO:0002250", "GO:0002250", "GO:0002250"
), n = c(15L, 15L, 15L, 15L, 15L, 6L, 6L, 6L, 8L, 8L, 8L, 8L, 
8L), gene = c("NRP1", "ANG", "THY1", "ATP5F1B", "ECM1", "ANG", 
"CAT", "HSP90B1", "IGKV1-27", "IGHV3-21", "TNFRSF21", "IGLV2-11", 
"IGHV4-34"), go_name = c("angiogenesis", "angiogenesis", "angiogenesis", 
"angiogenesis", "angiogenesis", "response to hypoxia", "response to hypoxia", 
"response to hypoxia", "adaptive immune response", "adaptive immune response", 
"adaptive immune response", "adaptive immune response", "adaptive immune response"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
cmirian
  • 2,572
  • 3
  • 19
  • 59

1 Answers1

2

We could paste by group

library(dplyr)
BT_Ctrl_go_term %>% 
    group_by(go_term, n, go_name) %>% 
    summarise(gene = toString(unique(gene)), .groups = 'drop')

-ouptut

# A tibble: 3 x 4
  go_term        n go_name                  gene                                            
  <chr>      <int> <chr>                    <chr>                                           
1 GO:0001525    15 angiogenesis             NRP1, ANG, THY1, ATP5F1B, ECM1                  
2 GO:0001666     6 response to hypoxia      ANG, CAT, HSP90B1                               
3 GO:0002250     8 adaptive immune response IGKV1-27, IGHV3-21, TNFRSF21, IGLV2-11, IGHV4-34
akrun
  • 874,273
  • 37
  • 540
  • 662