3

I have a 1.8m character string, and I need to split it by a 50 character string that appears once very close to the start of the 1.8m character string (about 10k characters in)

Using strsplit() errors

long_string %>% strsplit(., fifty_character_string)

# Error: C stack usage  9065064 is too close to the limit

I have tried to solve the specific error with this method, and this question, but no luck so far.

So now I am investigating whether there's a more memory efficient way to split a very long string into two. I am unlikely to need to do this more than a small number of times, so I am open to hacky methods that just get the job done

stevec
  • 41,291
  • 27
  • 223
  • 311
  • Honestly `str_split` should not be too bad of an implementation, as it should only have to walk down the string once. Hence, it should be linear with the size of the string. – Tim Biegeleisen Apr 30 '19 at 11:30
  • I doubt it has any effect, but did you get the same error without the `%>%` piping? – Dunois Apr 30 '19 at 11:36

2 Answers2

5

Here is a quick comparison of different methods to do this:

library(stringi)
library(dplyr)

# get some sample data
set.seed(1)
long_string <- stri_paste(stri_rand_lipsum(10000), collapse = " ")
x <- sample(9000:11000, 1)
split_string <- substr(long_string, x, x + 49)

result <- long_string %>% strsplit(., split_string)
length(unlist(result))
#> [1] 2

substr_fun <- function(str, pattern) {
  idx <- regexpr(pattern, str, fixed = TRUE)
  res1 <- list(c(substr(str, 1, idx-1), substr(str, idx + attr(idx, "match.length"), nchar(str))))
  return(res1)  
}

bench::mark(
  strsplit_dplyr = long_string %>% strsplit(., split_string),
  strsplit_dplyr_fixed = long_string %>% strsplit(., split_string, fixed = TRUE),
  strsplit = strsplit(long_string, split_string),
  strsplit_fixed = strsplit(long_string, split_string, fixed = TRUE),
  stri_split_fixed = stringi::stri_split_fixed(long_string, split_string),
  str_split = stringr::str_split(long_string, stringr::fixed(split_string)),
  substr_fun = substr_fun(long_string, split_string)
)
#> # A tibble: 7 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 strsplit_dplyr          131ms  134.8ms      7.44      280B        0
#> 2 strsplit_dplyr_fixed   36.6ms   37.6ms     26.5       280B        0
#> 3 strsplit                133ms  133.8ms      7.40        0B        0
#> 4 strsplit_fixed         35.4ms   37.2ms     26.7         0B        0
#> 5 stri_split_fixed       40.7ms   42.5ms     23.6     6.95KB        0
#> 6 str_split              41.6ms   43.1ms     23.4    35.95KB        0
#> 7 substr_fun             13.6ms   14.8ms     67.1         0B        0

In terms of memory usage, strsplit with the option fixed = TRUE and without the overhead from piping is the best solution. The implementations in stringi and stringr seem to be a little faster but their overhead in terms of memory is even larger than the effect from piping.

Update

I added the method from @H 1 answer and also his approach to get a 50 character substring to use for splitting. Only change is I wrapped it in a function and added fixed = TRUE again since I think it makes more sense in this case.

The new function is the clear winner if you do not want to make more than one split in your string!

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • 2
    How much overhead is the piping adding here? It appears that `strsplit_dplyr_fixed` (which is one of the two piped versions in your benchmark here, with FIXED = FALSE (?)) is comparable in execution time to strsplit_fixed (the non-piped, FIXED = TRUE version). – Dunois Apr 30 '19 at 12:00
  • 1
    execution time but not memory usage. The default of `strsplit` is `fixed = FALSE`. I just corrected this in the code above. I ran `strsplit_dplyr_fixed ` with `fixed = TRUE` but missed to update the code when updating the results... – JBGruber Apr 30 '19 at 12:05
  • @JBGruber For the record, I tried `strsplit(.. fixed=TRUE)` and your custom function (substr_fun) and both worked right away. Awesome help – stevec May 01 '19 at 08:21
3

As the string is only to be split into two an efficient way to approach this would be to use a combination of regexpr() and substr().

# Generate string (10m char) and pattern
set.seed(10)
long_string <- paste0(sample(letters, 1e+7, replace = TRUE), collapse ="")
x <- sample(9000:11000, 1)
fifty_character_string <- substr(long_string, x, x + 49)

# Find index and split
idx <- regexpr(fifty_character_string, long_string)
res1 <- list(c(substr(long_string, 1, idx-1), substr(long_string, idx + attr(idx, "match.length"), nchar(long_string))))
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56