3

This seems like a really simple task, but I can't find a good solution in base R. I have a character string with 2N characters. How do I split this into a character vector of length N, with each element being a 2-character string?

I could use something like substr with Vectorize:

vss <- Vectorize(substr, c("start", "stop"))
ch <- paste(rep("a", 1e6), collapse="")
vss(ch, seq(1, nchar(ch), by=2), seq(2, nchar(ch), by=2))

but this is really slow for long strings (O(N^2) I believe).

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 2
    Use `substring` which is vectorized also in `first` and `last`. – nicola Mar 26 '16 at 06:46
  • That works, but it has the same problem as `Vectorize(substr)` does, ie O(N^2) runtime. It also makes N/2 copies of the initial string so it's also O(N^2) memory! – Hong Ooi Mar 26 '16 at 06:58
  • 1
    If the chars in your string are ASCII (or at least you don't have multibyte chars), you could try `apply(matrix(charToRaw(ch),nrow=2),2,rawToChar)` which appears to be much faster than `substring`, and scale mostly linearly. – nicola Mar 26 '16 at 06:59
  • 2
    GSee's answer runs v. fast http://stackoverflow.com/questions/2247045/chopping-a-string-into-a-vector-of-fixed-width-character-elements – user20650 Mar 26 '16 at 14:46

1 Answers1

3

If you want speed, Rcpp is always a good choice:

library(Rcpp);
cppFunction('
    List strsplitN(std::vector<std::string> v, int N ) {
        if (N < 1) throw std::invalid_argument("N must be >= 1.");
        List res(v.size());
        for (int i = 0; i < v.size(); ++i) {
            int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
            std::vector<std::string> resCur(num,std::string(N,0));
            for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
            res[i] = resCur;
        }
        return res;
    }
');

ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
##    user  system elapsed
##   0.109   0.015   0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000

Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.


More demos:

strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.

There are two important caveats with the above implementation:

1: It doesn't handle NAs correctly. Rcpp seems to stringify to 'NA' when it's forced to come up with a std::string. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA.

x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##

2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.

strsplitN('aΩ',1L);
## [[1]]
## [1] "a"    "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • Thank you very much for this answer. Would you mind if I make a package on GitHub using this Funktion? Naturally, I'll credit you in the DESCRIPTION. I need this in a script that will run several times a day. A package would save me compiling. – BerriJ Feb 16 '22 at 20:41
  • You're very welcome. Sure, go right ahead. – bgoldst Feb 16 '22 at 21:33
  • 1
    The package can be found here: https://github.com/BerriJ/strsplit.fix It really just contains you function and no docs (yet). Maybe its also useful for someone else :) – BerriJ Feb 16 '22 at 22:37