I have a regex pattern that looks for multiple words in a text and returns the match + (up to) five words that precede the match and the five words that follow the match.
The problem is that if within this range of words the regex matches multiple terms, only the first match will be returned.
For example, the following regex essentially looks for the words "book" and "page"
and the \\b(?:\\W+\\w+){0,5}
part before and behind the regex also includes the extra words.
The following example only returns a single match:
test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."
my_regex <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+|\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}"
stringr::str_extract_all(test_str, pattern = my_regex)
[[1]]
[1] "Made out of wood, a book can contain many pages that"
While I would expect:
[[1]]
[1] "Made out of wood, a **book** can contain many pages that"
[2] "a book can contain many **pages** that are used to transmit"
(Matches highlighted)
I tried to solve this by using a positive lookahead assertion but I did not get it to work as I wanted. What can I do to modify my regex?