how to split a sequence in R into multiple sub parts

Question

seq="GAGTAGGAGGAG",how to split this sequence into the following sub sequence "GAG","TAG","GAG","GAG"i.e how to split the sequence in groups of threes

You asked the same question yesterday `strsplit("GAGTAGGAGGAG", "(?<=.{3})", perl=TRUE)` — Pierre L, Jul 20 '16 at 14:10
This also works: `library(gsubfn); strapplyc(xx, "...")[[1]]` where there are three dots in a row. — G. Grothendieck, Jul 20 '16 at 14:15
I am getting the output if I give the sequence directly but if I give sequence =readDNAStringSet("a.fasta") and then give strsplit(sequence,"(?<=.{3})", perl=TRUE)) am getting error @Pierre Lafortune — shrinirajesh, Jul 20 '16 at 14:16
We do not know what the `readDNAStringSet("a.fasta")` output is. How do you expect us to help with it? — Pierre L, Jul 20 '16 at 14:21
take any fasta file and then assign it to a variable using readDNAStringSet and then try strsplit — shrinirajesh, Jul 20 '16 at 14:25
I do not have fasta files. Add a small example of the output to the question in the form `dput(head(readDNAStringSet("a.fasta")))` — Pierre L, Jul 20 '16 at 14:27
What do you get when you enter `str(readDNAStringSet("a.fasta"))`? Add it to your question — Pierre L, Jul 20 '16 at 15:19

score 1 · Answer 1 · answered Jul 20 '16 at 14:15

We can create a function called fixed_split that will split a character string into equal parts. The regular expression is a lookbehind that matches on n elements together:

fixed_split <- function(text, n) {
  strsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}

fixed_split("GAGTAGGAGGAG", 3)
[[1]]
[1] "GAG" "TAG" "GAG" "GAG"

Edit

In your comment you say sequence ="ATGATGATG" does not work:

strsplit(sequence,"(?<=.{3})", perl=TRUE)
[[1]]
[1] "ATG" "ATG" "ATG"

how to split a sequence in R into multiple sub parts

1 Answers1