How to split elements in different columns by different symbol using R?

Question

structure(list(ref = c("A", "C", "A", "G", "C"), alt = c("T，TAA，TAAAA，TAAAAA", 
"G，GC，GCCG", "T", "A G", "G"), chr = c("chr1", "chr1", 
"chr1", "chr1", "chr2"), pos_s = c(2313007, 2456780, 2578901, 
2689511, 18907652), pos_e = c(2313009, 2456784, 2578903, 2689513, 
18907654), format = c("GT:AD", "GT:AD", "GT:AD", "GT:AD", "GT:AD"
), info = c("0/1/2/3/4:296,5,33,29,55", "0/1/2/3:376,22,13,7", 
"0/1:323,24", "0/1:288,21", "0/1:3342,25")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L))

enter image description here

Here is a dataframe, in which the alt and info column contain the characters separated by different symbol, like comma, slash and colon. I want to move these characters into different rows and keep the rest columns. The difficult thing is that the info column contains a variety of data and different kinds of symbol.

Here is an example, which is the output that I expect.

structure(list(ref = c("A", "A", "A", "A", "C", "C", "C", "A", 
"G", "C"), alt = c("T", "TAA", "TAAAA", "TAAAAA", "G", "GC", 
"GCCG", "T", "A G", "G"), chr = c("chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr2"), pos_s = c(2313007, 
2313007, 2313007, 2313007, 2456780, 2456780, 2456780, 2578901, 
2689511, 18907652), pos_e = c(2313009, 2313009, 2313009, 2313009, 
2456784, 2456784, 2456784, 2578903, 2689513, 18907654), format = c("GT:AD", 
"GT:AD", "GT:AD", "GT:AD", "GT:AD", "GT:AD", "GT:AD", "GT:AD", 
"GT:AD", "GT:AD"), info = c("0/1:296,5", "0/2:296,33", "0/3:296,29", 
"0/4:296,55", "0/1:376,22", "0/2:376,13", "0/3:376,7", "0/1:323,24", 
"0/1:288,21", "0/1:3342,25")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L))

enter image description here

I've tried multiple variations of this, but none of them seem to work. Any ideas?

Does this answer your question? [Split data frame string column into multiple columns](https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns) — Sam Rogers, Apr 05 '23 at 05:51
Thank you for your answer,but it's probably not the suitable answer for my question,because the info column of my data is more complicated than it. — Pokemoon, Apr 05 '23 at 06:17

Werner · Accepted Answer · 2023-04-06T20:53:56.663

The following achieves what you want. Primarily, the following functions/considerations are used:

separate_long_delim splits a comma-separated entry into multiple rows, duplicating the other columns;
str_extract_all using the pattern \\d+ finds all digits;
Understanding the construction of the pattern in info allows for pulling the correct digits

You'll have to expand the construction of the strippedinfo if your info string has more than 4 / elements.

library(tidyverse)

d <- structure(
  list(
    ref = c("A", "C", "A", "G", "C"), 
    alt = c("T, TAA, TAAAA, TAAAAA", "G, GC, GCCG", "T", "A.G", "G"), 
    chr = c("chr1", "chr1", "chr1", "chr1", "chr2"), 
    pos_s = c(2313007, 2456780, 2578901, 2689511, 18907652), 
    pos_e = c(2313009, 2456784, 2578903, 2689513, 18907654), 
    format = c("GT:AD", "GT:AD", "GT:AD", "GT:AD", "GT:AD"), 
    info = c("0/1/2/3/4:296,5,33,29,55", "0/1/2/3:376,22,13,7","0/1:323,24", "0/1:288,21", "0/1:3342,25")
  ), 
  class = c("tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -5L)
)

stripinfo <- function (infostring) {
  digext <- str_extract_all(
    string = infostring,
    pattern = '\\d+'
  )[[1]]
  
  strippedinfo <- switch(
    EXPR = str_count(
      string = infostring,
      pattern = '/'
    ),
    infostring, # Only 1 '/' found
    paste(
      paste0('0/1:', digext[4], ';', digext[5]),
      paste0('0/2:', digext[4], ';', digext[6]),
      sep = ','
    ), # Two '/' found
    paste(
      paste0('0/1:', digext[5], ';', digext[6]),
      paste0('0/2:', digext[5], ';', digext[7]),
      paste0('0/3:', digext[5], ';', digext[8]),
      sep = ','
    ), # Three '/' found
    paste(
      paste0('0/1:', digext[6], ';', digext[7]),
      paste0('0/2:', digext[6], ';', digext[8]),
      paste0('0/3:', digext[6], ';', digext[9]),
      paste0('0/4:', digext[6], ';', digext[10]),
      sep = ','
    ), # Four '/' found
    paste(infostring, '!') # Add more here...
  )
  
  return(strippedinfo)
  
}

d

d %>% mutate(
  info = str_replace_all(
    string = info,
    pattern = ',',
    replacement = ';'
  ) %>% map_chr(
    .f = stripinfo
  )
) %>% separate_longer_delim(
  cols = c(
    alt, info
  ),
  delim = stringr::regex('\\s*,\\s*')
) %>% mutate(
  info = str_replace(
    string = info,
    pattern = ';',
    replacement = ','
  )
)

Output:

# A tibble: 10 x 7
   ref   alt    chr      pos_s    pos_e format info       
   <chr> <chr>  <chr>    <dbl>    <dbl> <chr>  <chr>      
 1 A     T      chr1   2313007  2313009 GT:AD  0/1:296,5  
 2 A     TAA    chr1   2313007  2313009 GT:AD  0/2:296,33 
 3 A     TAAAA  chr1   2313007  2313009 GT:AD  0/3:296,29 
 4 A     TAAAAA chr1   2313007  2313009 GT:AD  0/4:296,55 
 5 C     G      chr1   2456780  2456784 GT:AD  0/1:376,22 
 6 C     GC     chr1   2456780  2456784 GT:AD  0/2:376,13 
 7 C     GCCG   chr1   2456780  2456784 GT:AD  0/3:376,7  
 8 A     T      chr1   2578901  2578903 GT:AD  0/1:323,24 
 9 G     A.G    chr1   2689511  2689513 GT:AD  0/1:288,21 
10 C     G      chr2  18907652 18907654 GT:AD  0/1:3342,25

Thank you very much for your help. This solution has given me a lot of ideas. May I ask how to remove duplicates from the data you processed? The result obtained by doing this has more rows than I expected. Thank you again for your help. — Pokemoon, Apr 06 '23 at 12:39
@Pokemoon: My apologies; I've corrected the pipe to combine the row separation rather than do them separately. See the updated answer. — Werner, Apr 06 '23 at 20:54

How to split elements in different columns by different symbol using R?

1 Answers1