Efficiently break up a string based on the nth occurrence of a substring using R

Question

Introduction

Given a string in R, is it possible to get a vectorized solution (i.e. no loops) where we can break the string into blocks where each block is determined by the nth occurrence of a substring in the string.

Work done with Reproducible Example

Suppose we have several paragraphs of the famous Lorem Ipsum text.

library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "

We would like to break this text into segments at every 3rd occurrence of the the word " in" (a space is included in order to distinguish from words which contain "in" as part of them, such as "min").

I have the following solution with a while loop:

# We wish to break up the string at every 
# 3rd occurence of the worn "in"

break.character = " in"
break.occurrence = 3
string.list = list()
i = 1

# initialize string to send into the loop
current.string = my.string

while(length(current.string) > 0){

  # Enter segment into the list which occurs BEFORE nth occurence character of interest
  string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)

  # Update next string to exmine.
  # Next string to examine is current string AFTER nth occurence of character of interest
  current.string = str_after_nth(current.string, break.character, break.occurrence)

  i = i + 1
}

We are able to get the desired output in a list with a warning (warning not shown)

> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"

[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...

[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"

Goal

Is it possible to improve this solution by vectorizing (i.e. using apply(), lapply(), mapply() etc.). Also, my current solution cut's off the last occurrence of the substring in a block.

The current solution may not work well on extremely long strings (such as DNA sequences where we are looking for blocks with the nth occurrence of a substring of nucleotides).

For this example, is there any reason there is a space before "in" and not after? I.e., are you looking for "in" and words that start with "in"? — Andrew, Apr 04 '19 at 16:26
@Andrew the space after the "in" should also work. I am looking for looking for the substring "in" that occurs on it's own. It should not be part of another word. Please see paragraph after first segment of code for details. — NM_, Apr 04 '19 at 16:45
I saw the paragraph. Also, do you want the "in" to still be included in your string once it is broken up? If you want it included, do you want it pre-break or post (if it matters)? EDIT: scratch that, looks like you figure it out :) — Andrew, Apr 04 '19 at 16:52
@Andrew : it would be preferred to include the substring at the end of each block. My solution currently cuts it out. — NM_, Apr 04 '19 at 16:54

score 1 · Accepted Answer · answered Apr 04 '19 at 16:27

Try with this:

text_split=strsplit(text," in ")[[1]]

l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)

L= list()
L=sapply(Seq, function(x){
  paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}

Last conditional is in case number of in is not divisible by 3. Also, the last in pasted in the sapply() is there because I don't know what you want to do with the one in that separates your blocks.

Andrew · Answer 2 · 2019-04-04T18:10:11.200

Let me know if this does the trick. I will try to make it faster. It keeps the third in in the code block. If it works I'll annotate it more too.

library(lipsum)
library(stringi)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)]) 
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]


stri_sub(my.string, start_of_strings, end_of_strings)

EDIT: actually, use stri_sub from stringi. It will scale much better than substring. See:

my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999

microbenchmark::microbenchmark(
  sol1 = {
    text_split=strsplit(my.string," in ")[[1]]

    l=length(text_split)
    n = floor(l/3)
    Seq = seq(1,by=2,length.out = n)

    L= list()
    L=sapply(Seq, function(x){
      paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
    })
    if (l>(n*3)){
      L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
    }
  },
  sol2 = {
    end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
    start_of_strings <- c(1, end_of_in[c(F, F, T)]) 
    end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
    end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
    stri_sub(my.string, start_of_strings, end_of_strings)
  },
  times = 10
)

Unit: milliseconds
 expr      min        lq      mean    median        uq       max neval
 sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941    10
 sol2  55.4163  56.40759  58.53444  56.86043  57.03707  71.02974    10

This is close to the solution. I believe that there should be 6 blocks of substrings. This solution picks 5 of them (not the last one). — NM_, Apr 04 '19 at 17:10
Ok, @NM_, I think I fixed it. It should be more flexible now too. — Andrew, Apr 04 '19 at 17:30
@NM_ Glad it works! Also, I just added an edit. Be sure to use `stri_sub` (from `stringi`) instead of `substring` if you have a lot of data. Good luck! — Andrew, Apr 04 '19 at 18:21

Efficiently break up a string based on the nth occurrence of a substring using R

2 Answers2