2

How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = ""). For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.

I'm looking for the fastest way possible. After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.

Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.

splitInParts <- function(string, size) {              #can process a vector of strings. "size" is the length of desired substrings
    chars <- strsplit(string,"",T)
    lengths <- nchar(string)
    nFullGroups <- floor(lengths/size)                #the number of complete substrings of the desired size

    #here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
    seps  <-  Map(rep, ",", lengths + nFullGroups)     #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
    indices <- Map(seq, 1, lengths + nFullGroups)      #the positions at which separators will be replaced by the characters
    indices <- lapply(indices, function(x) which(x %% (size+1) != 0))  #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)

    temp <- function(x,y,z) {        #a fonction describing the replacement, because we call it in the Map() call below
        x[y] <- z
        x
    }
    res <- Map(temp, seps, indices, chars)             #so now we have a vector of chars with separators interspersed
    res <- sapply(res, paste, collapse="", USE.NAMES=F)  #collapses the characters and separators
    res <- strsplit(res, ",", T)                        #and at last, we can split the strings into elements of the desired length
}

This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.

So if you can find something faster, let me know. If not, well my function may be of some use. :)

jeanlain
  • 382
  • 1
  • 3
  • 13

3 Answers3

4

Was fun reading the updates, so I benchmarked:

> nchar(mystring)
[1] 260000

My idea was near the same as @akrun's one as str_extract_all use the same function under the hood IIRC)

library(stringr)
tensiSplit <- function(string,size) {
  str_extract_all(string, paste0('.{1,',size,'}'))
}

And the results on my machine:

> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
                       expr        min         lq       mean     median         uq        max neval
  splitInParts(mystring, 3)   64.80683   64.83033   64.92800   64.85384   64.98858   65.12332     3
    akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983     3
 splitInParts2(mystring, 3)   21.73150   21.73829   21.90200   21.74507   21.98725   22.22942     3
    tensiSplit(mystring, 3)   21.80367   21.85201   21.93754   21.90035   22.00447   22.10859     3
     gsubSplit(mystring, 3)   53.90416   54.28191   54.55416   54.65966   54.87915   55.09865     3
Tensibai
  • 15,557
  • 1
  • 37
  • 57
3

We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).

splitInParts <- function(string, size){
    pat <- paste0('(?<=.{',size,'})')
    strsplit(string, pat, perl=TRUE)
 }

splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"  

splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"  

splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"

Or another approach is using stri_extract_all from library(stringi).

library(stringi)
splitInParts2 <- function(string, size){
   pat <- paste0('.{1,', size, '}')
   stri_extract_all_regex(string, pat)
 }
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"  

stri_extract_all_regex(str1, '.{1,3}')

data

 str1 <- "azertyuiop"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. But it appears to be slower. My solution takes 9 secs to complete on 35838 gene sequences of total length 41413966, while your solution takes 31 secs. The other solution I posted above takes 6 secs. – jeanlain Sep 04 '15 at 13:26
  • @jeanlain Can you try with the `stringi` approach. – akrun Sep 04 '15 at 13:33
1

Alright, there was a faster solution published here (d'oh!)

Simply

strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)

Here using a space as separator. (didn't think about [[:allnum::]]{}).

How can I mark my own question as a duplicate? :(

Community
  • 1
  • 1
jeanlain
  • 382
  • 1
  • 3
  • 13