Here is a quick comparison of different methods to do this:
library(stringi)
library(dplyr)
# get some sample data
set.seed(1)
long_string <- stri_paste(stri_rand_lipsum(10000), collapse = " ")
x <- sample(9000:11000, 1)
split_string <- substr(long_string, x, x + 49)
result <- long_string %>% strsplit(., split_string)
length(unlist(result))
#> [1] 2
substr_fun <- function(str, pattern) {
idx <- regexpr(pattern, str, fixed = TRUE)
res1 <- list(c(substr(str, 1, idx-1), substr(str, idx + attr(idx, "match.length"), nchar(str))))
return(res1)
}
bench::mark(
strsplit_dplyr = long_string %>% strsplit(., split_string),
strsplit_dplyr_fixed = long_string %>% strsplit(., split_string, fixed = TRUE),
strsplit = strsplit(long_string, split_string),
strsplit_fixed = strsplit(long_string, split_string, fixed = TRUE),
stri_split_fixed = stringi::stri_split_fixed(long_string, split_string),
str_split = stringr::str_split(long_string, stringr::fixed(split_string)),
substr_fun = substr_fun(long_string, split_string)
)
#> # A tibble: 7 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 strsplit_dplyr 131ms 134.8ms 7.44 280B 0
#> 2 strsplit_dplyr_fixed 36.6ms 37.6ms 26.5 280B 0
#> 3 strsplit 133ms 133.8ms 7.40 0B 0
#> 4 strsplit_fixed 35.4ms 37.2ms 26.7 0B 0
#> 5 stri_split_fixed 40.7ms 42.5ms 23.6 6.95KB 0
#> 6 str_split 41.6ms 43.1ms 23.4 35.95KB 0
#> 7 substr_fun 13.6ms 14.8ms 67.1 0B 0
In terms of memory usage, strsplit
with the option fixed = TRUE
and without the overhead from piping is the best solution. The implementations in stringi
and stringr
seem to be a little faster but their overhead in terms of memory is even larger than the effect from piping.
Update
I added the method from @H 1 answer and also his approach to get a 50 character substring to use for splitting. Only change is I wrapped it in a function and added fixed = TRUE
again since I think it makes more sense in this case.
The new function is the clear winner if you do not want to make more than one split in your string!