how to get all possible sequences from IUPAC notation using R

Question

I have a vector of DNA sequences with IUPAC notation (https://www.bioinformatics.org/sms/iupac.html). For example, given the sequence, and the notation:

seq <- "AATCRVTAA"
iuapc <- data.table(code = c("A", "C", "G", "T", "R", "Y", "S", "W", "K", "M", "B", "D", "H", "V", "N"),
                base = c("A", "C", "G", "T", "AG", "CT", "GC", "AT", "GT", "AC", "CGT", "AGT", "ACT", "ACG", "ACGT"))

Where "R" and "V" are ambiguous values of DNA nucleotides, and "R" represents either "A" or "G" and "V" represents "A", "C" or "G".

How can I generate all the different combinations of sequences that could be represented by the above ambiguous sequence?

The output for this example sequence would be:

"AATCAATAA"
"AATCACTAA"
"AATCAGTAA"
"AATCGATAA"
"AATCGCTAA"
"AATCGGTAA"

The vector of sequences is quite large, so performance is important. Any help will be greatly appreciated!

This question has already been asked for Python here: how to extend ambiguous dna sequence

Just curious, what does "quite large" mean for you here? – Jon Spring Feb 19 '21 at 09:04 — Jon Spring, Feb 19 '21 at 09:04

score 1 · Answer 1 · answered Feb 19 '21 at 09:28

Leveraging from your earlier question today (https://stackoverflow.com/a/66274136/6851825), here's a kludgey tidyverse/base approach:

library(tidyverse)

tibble(seq) %>%
  separate_rows(seq, sep = '(?<=.)(?=.)') %>%
  left_join(iuapc, by = c("seq" = "code")) %>%
  pull(base) %>%
  str_split("") %>%
  expand.grid(stringsAsFactors = FALSE)

#  Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
#1    A    A    T    C    A    A    T    A    A
#2    A    A    T    C    G    A    T    A    A
#3    A    A    T    C    A    C    T    A    A
#4    A    A    T    C    G    C    T    A    A
#5    A    A    T    C    A    G    T    A    A
#6    A    A    T    C    G    G    T    A    A

s_baldur · Answer 2 · 2021-02-19T10:16:22.710

1

Here is something very raw:

library(data.table)
library(magrittr)

# Convert iuapc$base to list of vectors
iuapc[, base := list(strsplit(base, ''))]
setkey(iuapc, code)


tstrsplit(seq, '') %>% 
  lapply(function(x) iuapc[x, base[[1]]]) %>% 
  do.call(CJ, .) %>% 
  .[, paste(.SD, collapse = ''), by = 1:nrow(.)] %>% 
  .[, V1]

# [1] "AATCAATAA" "AATCACTAA" "AATCAGTAA" "AATCGATAA" "AATCGCTAA" "AATCGGTAA"

edited Feb 19 '21 at 10:16

answered Feb 19 '21 at 09:52

s_baldur

29,441
4
36
69

score 0 · Answer 3 · answered Feb 19 '21 at 10:22

library(stringr)

all.seq.iuapc <- function(seq, dictio_replace){
  seq <- toupper(seq)
  vec <- strsplit(seq, "")[[1]]
  vec2 <- str_replace_all(string = vec, pattern= dictio_replace)
  tmp <- expand.grid(strsplit(vec2, ""), stringsAsFactors = FALSE)
  strings <- apply(tmp, 1, paste0, collapse = "")
  return(strings)
}

dictio_replace= c("A" = "A",
                  "C" = "C",
                  "G" = "G",
                  "T" = "T",
                  "R" = "AG",        
                  "Y" = "CT",
                  "S" = "GC",
                  "W" = "AT",
                  "K" = "GT",
                  "M" = "AC",
                  "B" = "CGT",
                  "D" = "AGT",
                  "H" = "ACT",
                  "V" = "ACG",
                  "N" = "ACGT")

how to get all possible sequences from IUPAC notation using R

3 Answers3