spiting data in data.frame divided by a comma onto a new row

Question

I have a data.frame with three columns, the first column(Diagnosis) is a diagnosis HCM or DCM. The Second column(Result) is the genetic results of either: Pathogenic(P), Likely pathogenic(LP), Variant of unknown significance(VUS). In the third column(Gene.Name) is he name of the gene involved i.e. MYH7, RAF1, TTN, etc....

The problem I have is when a row has more than one gene involved e.g on row 69 under the Result column it has 'LP,VUS' and under the Gene.Name column it has 'RAF1(LP),TTN(VUS)'.

How can I get R to make a new row under row 69 and what come before the ',' in both columns('LP' and 'RAF1') to stay on one row and move what comes after the ','('VUS' and 'TTN) onto the new row? while keeping the diagnosis column constant

I want to able to do this for all 68 rows baring in mind that there are some columns where the result in the Result column applies to more than one gene in the Gene.Name column i.e. row 6 which has in the Results column 'VUS', and in the Gene.Name column 'DSC2, LAMAS, TTN' all of which are VUS.

The end goal is to make a histogram showing the frequency of each gene, divided into HCM and DCM and also by pathogenicity

I'm not sure if R is capable of this, but I don't know, so I thought I would ask.

Thanks in advance

here is the dput of my data(it contains 68 observations, but it is subset from another dataframe, so therefore what is row number 69 in my example may be row 23 when you put it through R:

structure(list(Diagnosis = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 
3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L), .Label = c("DCM", 
"DCM ", "HCM"), class = "factor"), Result = structure(c(10L, 
10L, 10L, 10L, 2L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 2L, 
10L, 10L, 10L, 6L, 2L, 2L, 6L, 10L, 9L, 4L, 6L, 10L, 2L, 6L, 
6L, 6L, 10L, 10L, 9L, 6L, 10L, 2L, 7L, 3L, 10L, 2L, 6L, 2L, 2L, 
10L, 10L, 2L, 2L, 2L, 6L, 2L, 10L, 10L, 2L, 10L, 2L, 10L, 10L, 
2L, 2L, 10L, 8L, 10L, 9L, 8L, 10L, 6L, 6L, 6L), .Label = c("", 
"LP", "LP, VUS", "LP,VUS", "NIL", "P", "P, LB", "P, VUS", "P,VUS", 
"VUS"), class = "factor"), Gene.Name = structure(c(41L, 10L, 
41L, 12L, 39L, 41L, 38L, 4L, 42L, 27L, 17L, 37L, 24L, 18L, 9L, 
3L, 26L, 14L, 25L, 34L, 15L, 13L, 7L, 35L, 32L, 30L, 19L, 32L, 
21L, 29L, 31L, 6L, 20L, 19L, 29L, 19L, 33L, 22L, 41L, 21L, 29L, 
19L, 19L, 19L, 16L, 16L, 16L, 40L, 1L, 21L, 2L, 8L, 41L, 41L, 
39L, 41L, 36L, 38L, 21L, 11L, 23L, 39L, 5L, 23L, 28L, 19L, 21L, 
21L), .Label = c("", "ACTA1", "ACTA1,TTN", "ACTN2,LAMA2,LDB3,NPPA,TTN", 
"ALMS1/TXNRD2", "ANK2(P),BAG3, FKTN,MIB1, RYR2, SLC25A4, TTN,TTNn3", 
"DES(P),DSC2(VUS),TTN(VUS)", "DES, TTN", "DMD,MYH6,PKP2,TTN", 
"DSC2,LAMAS,TTN", "DSP", "EMD,TTN", "HRAS", "KRAS", "LMAP2", 
"LMNA", "LMNA,ANKRD1,TTN,ACTN2", "MAP3K7", "MYBPC3", "MYEBC3(P),PRKAG2(VUS)", 
"MYH7", "MYH7(LP),LAMP2(VUS), MYBPC3(VUS),TTN(VUS)", "MYH7(P),VCL(VUS)", 
"MYH7, BAG3, NEXN", "MYH7, TTN", "MYL2", "PKP2, TTN", "PKP2,TTN", 
"PRKAG2", "PRKAG2, TTN", "PRKAG2,MYH7", "PTPN11", "PTPN11(P), MYL3(LB)", 
"RAF1", "RAF1(LP),TTN(VUS)", "RBM20", "TBX20,TTN", "TNNI3", "TNNT2", 
"TPM1", "TTN", "TTN, DSG2"), class = "factor")), .Names = c("Diagnosis", 
"Result", "Gene.Name"), row.names = c(1L, 6L, 7L, 8L, 11L, 12L, 
14L, 15L, 17L, 18L, 22L, 25L, 29L, 30L, 32L, 33L, 47L, 57L, 59L, 
63L, 64L, 67L, 68L, 69L, 70L, 75L, 79L, 80L, 81L, 82L, 84L, 86L, 
88L, 89L, 92L, 93L, 95L, 98L, 100L, 101L, 106L, 107L, 109L, 110L, 
111L, 112L, 113L, 114L, 115L, 116L, 119L, 127L, 130L, 132L, 133L, 
134L, 137L, 138L, 139L, 140L, 141L, 142L, 145L, 148L, 150L, 151L, 
152L, 153L), class = "data.frame")

For the first step "delimited text to new row", see https://stackoverflow.com/a/31514711/680068 — zx8754, Apr 10 '18 at 08:57

Roman · Accepted Answer · 2018-04-10T09:31:42.580

You can try a tidyverse solution:

library(tidyverse)
d %>% 
  as.tibble(rownames = NULL) %>% 
  slice(c(1,8,23,32)) 
# A tibble: 4 x 3
  Diagnosis Result Gene.Name                                        
  <fct>     <fct>  <fct>                                            
1 DCM       VUS    TTN                                              
2 DCM       VUS    ACTN2,LAMA2,LDB3,NPPA,TTN                        
3 HCM       P,VUS  DES(P),DSC2(VUS),TTN(VUS)                        
4 HCM       VUS    ANK2(P),BAG3, FKTN,MIB1, RYR2, SLC25A4, TTN,TTNn3

For illustration purposes I focused onyl on some example rows c(1,8,23,32). Then I used a combination of tidyr's separate_rows and separate function. Finally I cleaned the data using mutate.

d %>% 
  as.tibble(rownames = NULL) %>% 
  slice(c(1,8,23,32)) %>% 
  separate_rows(Gene.Name, sep=",") %>% 
  separate(Gene.Name, into = c("Gene", "Res"), sep="[(]") %>% 
  mutate(Gene=str_trim(Gene),
         Res=str_trim(gsub("[)]","",Res))) %>% 
  mutate(Res=ifelse(is.na(Res), as.character(Result), Res))
# A tibble: 17 x 4
   Diagnosis Result Gene    Res  
   <fct>     <fct>  <chr>   <chr>
 1 DCM       VUS    TTN     VUS  
 2 DCM       VUS    ACTN2   VUS  
 3 DCM       VUS    LAMA2   VUS  
 4 DCM       VUS    LDB3    VUS  
 5 DCM       VUS    NPPA    VUS  
 6 DCM       VUS    TTN     VUS  
 7 HCM       P,VUS  DES     P    
 8 HCM       P,VUS  DSC2    VUS  
 9 HCM       P,VUS  TTN     VUS  
10 HCM       VUS    ANK2    P    
11 HCM       VUS    BAG3    VUS  
12 HCM       VUS    FKTN    VUS  
13 HCM       VUS    MIB1    VUS  
14 HCM       VUS    RYR2    VUS  
15 HCM       VUS    SLC25A4 VUS  
16 HCM       VUS    TTN     VUS  
17 HCM       VUS    TTNn3   VUS

and the plot using the complete data.

d %>% 
  as.tibble(rownames = NULL) %>% 
  separate_rows(Gene.Name, sep=",") %>% 
  separate(Gene.Name, into = c("Gene", "Res"), sep="[(]") %>% 
  mutate(Gene=str_trim(Gene),
         Res=str_trim(gsub("[)]","",Res))) %>% 
  mutate(Res=ifelse(is.na(Res), as.character(Result), Res)) %>% 
  filter(Gene != "") %>% 
    group_by(Diagnosis, Gene, Res) %>% 
    summarise(n=n()) %>% 
    ggplot(aes(x=Gene, y=n, fill=Res)) +
     geom_col()+ 
     facet_wrap(~Diagnosis, scales = "free_x") + 
     theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

I get this message when I try to install tidyverse: Warning in install.packages : package ‘tidyverse’ is not available (for R version 3.0.2) but i'm pretty sure my version of R is up to date, any advice? — Ningman, Apr 10 '18 at 09:39
Check the first line of the output of `sessionInfo()`. my version is `R version 3.4.1 (2017-06-30)`. I recomment to use RStudio as well. — Roman, Apr 10 '18 at 09:47

spiting data in data.frame divided by a comma onto a new row

1 Answers1