How to split a string into substrings of a given length?

Question

I have a string such as:

"aabbccccdd"

I want to break this string into a vector of substrings of length 2 :

"aa" "bb" "cc" "cc" "dd"

GSee · Accepted Answer · 2012-07-24T00:23:06.320

61

Here is one way

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

Timings

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
    sst <- strsplit(text, "")[[1]]
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
#  test replications elapsed relative user.self sys.self user.child sys.child
#1   g1          100  95.451 79.87531    95.438        0          0         0
#2   g2          100   1.195  1.00000     1.196        0          0         0

edited Jul 24 '12 at 00:23

answered Jul 23 '12 at 20:05

GSee

48,880
13
125
145

Interesting, didn't know about `substring`. Much nicer since `substr` doesn't take vector args for start/end. – mindless.panda Jul 23 '12 at 20:23
2

brilliant ! the second version is really really fast ! – MadSeb Jul 24 '12 at 01:54
I was wondering if there was something like this that would split "aabbbcccccdd" into aa bbb ccccc dd I use grepexpr at the moment. – jackStinger Jan 07 '13 at 12:32
@GSee You might want to re-post the g2 portion of this answer on the question this is a duplicate of: http://stackoverflow.com/questions/2247045/r-chopping-a-string-into-a-vector-of-character-elements/2247574#2247574, – Joe Jul 22 '14 at 20:50
Got any tricks to extend the fast version to arbitrary chunk length `n`? – mathematical.coffee Aug 14 '15 at 06:03
@mathematical.coffee maybe something like this: `do.call(paste0, lapply(seq_len(n), function(i) { idx <- rep(FALSE, n); idx[i] <- TRUE; sst[idx] }))` but see [my comment](http://stackoverflow.com/questions/11619616/how-to-split-a-string-into-substrings-of-a-given-length/11619681?noredirect=1#comment20974133_14942243) on Matthew's post about paying attention to whether your input is divisible by `n` – GSee Aug 15 '15 at 00:48
Double check that result: ~~~ test replications elapsed relative user.self sys.self user.child sys.child g1 100 0.262 1.000 0.216 0.044 0 0 g2 100 0.562 2.145 0.530 0.031 0 0 ~~~ – vwvan Sep 28 '20 at 05:13

score 19 · Answer 2 · answered Apr 24 '14 at 07:38

19

There are two easy possibilities:

s <- "aabbccccdd"

gregexpr and regmatches:

regmatches(s, gregexpr(".{2}", s))[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"

strsplit:

strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"

answered Apr 24 '14 at 07:38

Sven Hohenstein

80,497
17
145
168

1

these possibilities are equivalent for the proposed `s` but what if `s <- "aabbccccdde"`?. I like the second option better – rjss Jan 15 '20 at 20:23
1

The second option works for any number, e.g., `strsplit(s, "(?<=.{11})", perl = TRUE)[[1]]`, while the first only first for single digits. – Øystein S Jan 07 '22 at 12:17

score 12 · Answer 3 · answered Jul 23 '12 at 20:09

string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)

# the indices where each substr will start
starts <- seq(1,num.chars, by=2)

# chop it up
sapply(starts, function(ii) {
  substr(string, ii, ii+1)
})

Which gives

[1] "aa" "bb" "cc" "cc" "dd"

Matthew Lundberg · Answer 4 · 2013-02-18T18:43:50.860

One can use a matrix to group the characters:

s2 <- function(x) {
  m <- matrix(strsplit(x, '')[[1]], nrow=2)
  apply(m, 2, paste, collapse='')
}

s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

Unfortunately, this breaks for an input of odd string length, giving a warning:

s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
##   data length [3] is not a sub-multiple or multiple of the number of rows [2]

More unfortunate is that g1 and g2 from @GSee silently return incorrect results for an input of odd string length:

g1('abc')
## [1] "ab"

g2('abc')
## [1] "ab" "cb"

Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:

s <- function(x, n) {
  sst <- strsplit(x, '')[[1]]
  m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
  m[seq_along(sst)] <- sst
  apply(m, 2, paste, collapse='')
}

s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"

(It is indeed slower than g2, but faster than g1 by about a factor of 7)

If it's possible to have an odd number of characters, then it seems to me it would be faster to handle that after the fact than to introduce an `apply` loop. I bet this is faster: `out <- g2(x); if (nchar(x) %% 2 == 1L) out[length(out)] <- substring(out[length(out)], 1, 1); out` — GSee, Feb 18 '13 at 19:44

score 1 · Answer 5 · answered Apr 24 '14 at 07:28

Ugly but works

sequenceString <- "ATGAATAAAG"

J=3#maximum sequence length in file
sequenceSmallVecStart <-
  substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
    c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")

Gives ATG AAT AAA G

How to split a string into substrings of a given length?

5 Answers5

Linked

Related