I set out looking for a vectorised solution to this, in order to avoid
lapply()
ing one of the single string solutions across long vectors. Failing
to find an existing solution, I somehow fell down a rabbit hole of
painstakingly writing one in C. It ended up hilariously complicated compared
to the many one-line R solutions shown here (no thanks to me deciding to also
want to handle Unicode strings to match the R versions), but I thought I’d
share the result, in case it somehow someday helps somebody. Here’s what
eventually became of that:
#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>
// Find the width (in bytes) of a UTF-8 character, given its first byte
size_t utf8charw(char b) {
if (b == 0x00) return 0;
if ((b & 0x80) == 0x00) return 1;
if ((b & 0xe0) == 0xc0) return 2;
if ((b & 0xf0) == 0xe0) return 3;
if ((b & 0xf8) == 0xf0) return 4;
return 1; // Really an invalid character, but move on
}
// Find the number of UTF-8 characters in a string
size_t utf8nchar(const char* str) {
size_t nchar = 0;
while (*str != '\0') {
str += utf8charw(*str); nchar++;
}
return nchar;
}
SEXP C_str_chunk(SEXP x, SEXP size_) {
// Allocate a list to store the result
R_xlen_t n = Rf_xlength(x);
SEXP result = PROTECT(Rf_allocVector(VECSXP, n));
int size = Rf_asInteger(size_);
for (R_xlen_t i = 0; i < n; i++) {
const char* str = Rf_translateCharUTF8(STRING_ELT(x, i));
// Figure out number of chunks
size_t nchar = utf8nchar(str);
size_t nchnk = nchar / size + (nchar % size != 0);
SEXP chunks = PROTECT(Rf_allocVector(STRSXP, nchnk));
for (size_t j = 0, nbytes = 0; j < nchnk; j++, str += nbytes) {
// Find size of next chunk in bytes
nbytes = 0;
for (int cp = 0; cp < size; cp++) {
nbytes += utf8charw(str[nbytes]);
}
// Assign to chunks vector as an R string
SET_STRING_ELT(chunks, j, Rf_mkCharLenCE(str, nbytes, CE_UTF8));
}
SET_VECTOR_ELT(result, i, chunks);
}
// Clean up
UNPROTECT(n);
UNPROTECT(1);
return result;
}
I then put this monstrosity into a file called str_chunk.c
, and compiled with R CMD SHLIB str_chunk.c
.
To try it out, we need some set-up on the R side:
str_chunk <- function(x, n) {
.Call("C_str_chunk", x, as.integer(n))
}
# The (currently) accepted answer
str_chunk_one <- function(x, n) {
substring(x, seq(1, nchar(x), n), seq(n, nchar(x), n))
}
dyn.load("str_chunk.dll")
So what we’ve achieved with the C version is to take a vector inputs and return a list:
str_chunk(rep("0123456789AB", 2), 2)
#> [[1]]
#> [1] "01" "23" "45" "67" "89" "AB"
#>
#> [[2]]
#> [1] "01" "23" "45" "67" "89" "AB"
Now off we go with benchmarking.
We start off strong with a 200x improvement for a long(ish) vector of
short strings:
x <- rep("0123456789AB", 1000)
microbenchmark::microbenchmark(
accepted = lapply(x, str_chunk_one, 2),
str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> accepted 229.5826 216.8246 182.5449 203.785 182.3662 25.88823 100
#> str_chunk(x, 2) 1.0000 1.0000 1.0000 1.000 1.0000 1.00000 100
… which then shrinks to a distinctly less impressive 3x improvement for
large strings.
x <- rep(strrep("0123456789AB", 1000), 10)
microbenchmark::microbenchmark(
accepted = lapply(x, str_chunk_one, 2),
str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> accepted 2.77981 2.802641 3.304573 2.787173 2.846268 13.62319 100
#> str_chunk(x, 2) 1.00000 1.000000 1.000000 1.000000 1.000000 1.00000 100
dyn.unload("str_chunk.dll")
So, was it worth it? Well, absolutely not considering how long it took to
actually get working properly – But if this was in a package, it would have
saved quite a lot of time in my use-case (short strings, long vectors).