More memory efficient way than strsplit() to split a string into two in R

Question

I have a 1.8m character string, and I need to split it by a 50 character string that appears once very close to the start of the 1.8m character string (about 10k characters in)

Using strsplit() errors

long_string %>% strsplit(., fifty_character_string)

# Error: C stack usage  9065064 is too close to the limit

I have tried to solve the specific error with this method, and this question, but no luck so far.

So now I am investigating whether there's a more memory efficient way to split a very long string into two. I am unlikely to need to do this more than a small number of times, so I am open to hacky methods that just get the job done

Honestly `str_split` should not be too bad of an implementation, as it should only have to walk down the string once. Hence, it should be linear with the size of the string. — Tim Biegeleisen, Apr 30 '19 at 11:30
I doubt it has any effect, but did you get the same error without the `%>%` piping? — Dunois, Apr 30 '19 at 11:36

JBGruber · Accepted Answer · 2019-04-30T14:18:07.880

Here is a quick comparison of different methods to do this:

library(stringi)
library(dplyr)

# get some sample data
set.seed(1)
long_string <- stri_paste(stri_rand_lipsum(10000), collapse = " ")
x <- sample(9000:11000, 1)
split_string <- substr(long_string, x, x + 49)

result <- long_string %>% strsplit(., split_string)
length(unlist(result))
#> [1] 2

substr_fun <- function(str, pattern) {
  idx <- regexpr(pattern, str, fixed = TRUE)
  res1 <- list(c(substr(str, 1, idx-1), substr(str, idx + attr(idx, "match.length"), nchar(str))))
  return(res1)  
}

bench::mark(
  strsplit_dplyr = long_string %>% strsplit(., split_string),
  strsplit_dplyr_fixed = long_string %>% strsplit(., split_string, fixed = TRUE),
  strsplit = strsplit(long_string, split_string),
  strsplit_fixed = strsplit(long_string, split_string, fixed = TRUE),
  stri_split_fixed = stringi::stri_split_fixed(long_string, split_string),
  str_split = stringr::str_split(long_string, stringr::fixed(split_string)),
  substr_fun = substr_fun(long_string, split_string)
)
#> # A tibble: 7 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 strsplit_dplyr          131ms  134.8ms      7.44      280B        0
#> 2 strsplit_dplyr_fixed   36.6ms   37.6ms     26.5       280B        0
#> 3 strsplit                133ms  133.8ms      7.40        0B        0
#> 4 strsplit_fixed         35.4ms   37.2ms     26.7         0B        0
#> 5 stri_split_fixed       40.7ms   42.5ms     23.6     6.95KB        0
#> 6 str_split              41.6ms   43.1ms     23.4    35.95KB        0
#> 7 substr_fun             13.6ms   14.8ms     67.1         0B        0

In terms of memory usage, strsplit with the option fixed = TRUE and without the overhead from piping is the best solution. The implementations in stringi and stringr seem to be a little faster but their overhead in terms of memory is even larger than the effect from piping.

Update

I added the method from @H 1 answer and also his approach to get a 50 character substring to use for splitting. Only change is I wrapped it in a function and added fixed = TRUE again since I think it makes more sense in this case.

The new function is the clear winner if you do not want to make more than one split in your string!

How much overhead is the piping adding here? It appears that `strsplit_dplyr_fixed` (which is one of the two piped versions in your benchmark here, with FIXED = FALSE (?)) is comparable in execution time to strsplit_fixed (the non-piped, FIXED = TRUE version). — Dunois, Apr 30 '19 at 12:00
execution time but not memory usage. The default of `strsplit` is `fixed = FALSE`. I just corrected this in the code above. I ran `strsplit_dplyr_fixed ` with `fixed = TRUE` but missed to update the code when updating the results... — JBGruber, Apr 30 '19 at 12:05
@JBGruber For the record, I tried `strsplit(.. fixed=TRUE)` and your custom function (substr_fun) and both worked right away. Awesome help — stevec, May 01 '19 at 08:21

Ritchie Sacramento · Answer 2 · 2019-04-30T13:47:08.983

As the string is only to be split into two an efficient way to approach this would be to use a combination of regexpr() and substr().

# Generate string (10m char) and pattern
set.seed(10)
long_string <- paste0(sample(letters, 1e+7, replace = TRUE), collapse ="")
x <- sample(9000:11000, 1)
fifty_character_string <- substr(long_string, x, x + 49)

# Find index and split
idx <- regexpr(fifty_character_string, long_string)
res1 <- list(c(substr(long_string, 1, idx-1), substr(long_string, idx + attr(idx, "match.length"), nchar(long_string))))

More memory efficient way than strsplit() to split a string into two in R

2 Answers2

Update