I have a vector of DNA sequences with IUPAC notation (https://www.bioinformatics.org/sms/iupac.html). For example, given the sequence, and the notation:
seq <- "AATCRVTAA"
iuapc <- data.table(code = c("A", "C", "G", "T", "R", "Y", "S", "W", "K", "M", "B", "D", "H", "V", "N"),
base = c("A", "C", "G", "T", "AG", "CT", "GC", "AT", "GT", "AC", "CGT", "AGT", "ACT", "ACG", "ACGT"))
Where "R" and "V" are ambiguous values of DNA nucleotides, and "R" represents either "A" or "G" and "V" represents "A", "C" or "G".
How can I generate all the different combinations of sequences that could be represented by the above ambiguous sequence?
The output for this example sequence would be:
"AATCAATAA"
"AATCACTAA"
"AATCAGTAA"
"AATCGATAA"
"AATCGCTAA"
"AATCGGTAA"
The vector of sequences is quite large, so performance is important. Any help will be greatly appreciated!
This question has already been asked for Python here: how to extend ambiguous dna sequence