39

I have a string such as:

"aabbccccdd"

I want to break this string into a vector of substrings of length 2 :

"aa" "bb" "cc" "cc" "dd"

GSee
  • 48,880
  • 13
  • 125
  • 145
MadSeb
  • 7,958
  • 21
  • 80
  • 121

5 Answers5

61

Here is one way

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

Timings

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
    sst <- strsplit(text, "")[[1]]
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
#  test replications elapsed relative user.self sys.self user.child sys.child
#1   g1          100  95.451 79.87531    95.438        0          0         0
#2   g2          100   1.195  1.00000     1.196        0          0         0
GSee
  • 48,880
  • 13
  • 125
  • 145
  • Interesting, didn't know about `substring`. Much nicer since `substr` doesn't take vector args for start/end. – mindless.panda Jul 23 '12 at 20:23
  • 2
    brilliant ! the second version is really really fast ! – MadSeb Jul 24 '12 at 01:54
  • I was wondering if there was something like this that would split "aabbbcccccdd" into aa bbb ccccc dd I use grepexpr at the moment. – jackStinger Jan 07 '13 at 12:32
  • @GSee You might want to re-post the g2 portion of this answer on the question this is a duplicate of: http://stackoverflow.com/questions/2247045/r-chopping-a-string-into-a-vector-of-character-elements/2247574#2247574, – Joe Jul 22 '14 at 20:50
  • Got any tricks to extend the fast version to arbitrary chunk length `n`? – mathematical.coffee Aug 14 '15 at 06:03
  • @mathematical.coffee maybe something like this: `do.call(paste0, lapply(seq_len(n), function(i) { idx <- rep(FALSE, n); idx[i] <- TRUE; sst[idx] }))` but see [my comment](http://stackoverflow.com/questions/11619616/how-to-split-a-string-into-substrings-of-a-given-length/11619681?noredirect=1#comment20974133_14942243) on Matthew's post about paying attention to whether your input is divisible by `n` – GSee Aug 15 '15 at 00:48
  • Double check that result: ~~~ test replications elapsed relative user.self sys.self user.child sys.child g1 100 0.262 1.000 0.216 0.044 0 0 g2 100 0.562 2.145 0.530 0.031 0 0 ~~~ – vwvan Sep 28 '20 at 05:13
19

There are two easy possibilities:

s <- "aabbccccdd"
  1. gregexpr and regmatches:

    regmatches(s, gregexpr(".{2}", s))[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    
  2. strsplit:

    strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • 1
    these possibilities are equivalent for the proposed `s` but what if `s <- "aabbccccdde"`?. I like the second option better – rjss Jan 15 '20 at 20:23
  • 1
    The second option works for any number, e.g., `strsplit(s, "(?<=.{11})", perl = TRUE)[[1]]`, while the first only first for single digits. – Øystein S Jan 07 '22 at 12:17
12
string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)

# the indices where each substr will start
starts <- seq(1,num.chars, by=2)

# chop it up
sapply(starts, function(ii) {
  substr(string, ii, ii+1)
})

Which gives

[1] "aa" "bb" "cc" "cc" "dd"
mindless.panda
  • 4,014
  • 4
  • 35
  • 57
2

One can use a matrix to group the characters:

s2 <- function(x) {
  m <- matrix(strsplit(x, '')[[1]], nrow=2)
  apply(m, 2, paste, collapse='')
}

s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

Unfortunately, this breaks for an input of odd string length, giving a warning:

s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
##   data length [3] is not a sub-multiple or multiple of the number of rows [2]

More unfortunate is that g1 and g2 from @GSee silently return incorrect results for an input of odd string length:

g1('abc')
## [1] "ab"

g2('abc')
## [1] "ab" "cb"

Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:

s <- function(x, n) {
  sst <- strsplit(x, '')[[1]]
  m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
  m[seq_along(sst)] <- sst
  apply(m, 2, paste, collapse='')
}

s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld" 

(It is indeed slower than g2, but faster than g1 by about a factor of 7)

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • If it's possible to have an odd number of characters, then it seems to me it would be faster to handle that after the fact than to introduce an `apply` loop. I bet this is faster: `out <- g2(x); if (nchar(x) %% 2 == 1L) out[length(out)] <- substring(out[length(out)], 1, 1); out` – GSee Feb 18 '13 at 19:44
1

Ugly but works

sequenceString <- "ATGAATAAAG"

J=3#maximum sequence length in file
sequenceSmallVecStart <-
  substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
    c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")

Gives ATG AAT AAA G

den2042
  • 497
  • 4
  • 4