0

This type of question is already asked many times, however I could not get the answer according to my needs.

I know some way of splitting strings in R. If I have a string x <- "AGCAGT", and want to split the string into characters of three. I would do this by

substring(x, seq(1, nchar(x)-1, 3), seq(3, nchar(x), 3))

and string of two character much faster by

split <- strsplit(x, "")[[1]]
substrg <- paste0(split[c(TRUE, FALSE)], split[c(FALSE, TRUE)])

As a new user of R, I feel difficulty to split string according to my requirements. If x <- "AGCTG" and if I use substring(x, seq(1, nchar(x)-1, 3), seq(3, nchar(x), 3)), I do not get the last two character substring. I get

"AGC" ""

However I am interested to get something like

"AGC" "TG"

or if I have x <- "AGCT" and splitting 3 characters at a time, I would like to get some thing like

"AGC" "T"`

I short, how to split a string into substrings of desired length (2,3,4,5...n), and also retaining those remaining characters less than the desired length.

nicola
  • 24,005
  • 3
  • 35
  • 56
  • See the output of `seq(3, nchar(x), 3)`, which is the end of the substring and you'll get what the problem is. – nicola Feb 25 '16 at 11:31
  • It appears that there [there is an answer](http://stackoverflow.com/a/23262521/1655567) concerned with precisely the same problem. – Konrad Feb 25 '16 at 11:42
  • The marked duplicate does not solve my problem. The answer by `zx8754`, however, later deleted seems to solve my problem – Khawaja Owaise Hussain Feb 25 '16 at 11:52
  • @zx8754 Please consider to undelete the post. As per the OP's concerns, I am reopening the post. – akrun Feb 25 '16 at 14:29
  • @RichardScriven Please check my desired output above. Consider string `AGCGGCCAGCT` and three character split. – Khawaja Owaise Hussain Feb 25 '16 at 15:03
  • @akrun Thanks for reopening the post. Indeed not a duplicate, some rushed to mark it as duplicate. The solution @zx8754 works perfect. `x <- "AGCGGCCAGCTGCCTGAA" mylen <- 5 ss <- strsplit(x, "")[[1]] v1 <- sapply(split(ss, ceiling(seq_along(ss)/mylen)), paste, collapse = "")` – Khawaja Owaise Hussain Feb 25 '16 at 15:18
  • This is pretty confusing to follow as all solutions are comments. – cory Feb 25 '16 at 16:01

2 Answers2

1

Here is one possible solution in a few simple steps.

x <- "AGCGGCCAGCTGCCTGAA"

# desired length
mylen = 5

# start indices
start <- seq(1, nchar(x), mylen)

# end indicies
end <- pmin(start + mylen - 1, nchar(x))

substring(x, start, end)
[1] "AGCGG" "CCAGC" "TGCCT" "GAA" 
cdeterman
  • 19,630
  • 7
  • 76
  • 100
1

Answer by zx8754. But unfortunately he deleted the answer after some marked the question as duplicate. If he would like to post this as an answer, I'l delete my post.

x <- "AGCGGCCAGCTGCCTGAA"
mylen <- 5 
ss <- strsplit(x, "")[[1]]
sapply(split(ss, ceiling(seq_along(ss)/mylen)), paste, collapse = "")