0

I have a regex pattern that looks for multiple words in a text and returns the match + (up to) five words that precede the match and the five words that follow the match.

The problem is that if within this range of words the regex matches multiple terms, only the first match will be returned. For example, the following regex essentially looks for the words "book" and "page" and the \\b(?:\\W+\\w+){0,5} part before and behind the regex also includes the extra words.

The following example only returns a single match:

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+|\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}"

stringr::str_extract_all(test_str, pattern = my_regex)


[[1]]
[1] "Made out of wood, a book can contain many pages that"

While I would expect:

[[1]]
[1] "Made out of wood, a **book** can contain many pages that"
[2] "a book can contain many **pages** that are used to transmit"

(Matches highlighted)

I tried to solve this by using a positive lookahead assertion but I did not get it to work as I wanted. What can I do to modify my regex?

Rasul89
  • 588
  • 2
  • 5
  • 14
  • Does this answer your question? [Overlapping matches in R](https://stackoverflow.com/questions/25800042/overlapping-matches-in-r) – blhsing Jun 08 '23 at 09:42
  • From what I see it is mainly about the lookahead (?=...) I've tried, e.g. to wrap the middle part in this lookahead my_regex <- "(?i)\\b(?:\\w+\\W+){0,5}(?=(\\bbook?\\w+|\\bpage?\\w+))\\b(?:\\W+\\w+){0,5}" The result would be: [[1]] [1] "Made out of wood, a " "book can contain many " "" So somewhat better, but not completely the desired output. – Rasul89 Jun 08 '23 at 09:49
  • Possible alternate approach - use a regex to find the index/position of all matches rather than extract the match directly. Use a second to find the index/position of all word breaks/spaces. Combine those to get start and finish indices for each 'match' plus up to 5 breaks either side. – Paul Stafford Allen Jun 08 '23 at 10:04
  • @Rasul89 You need to enclose the entirety of what you want to capture in a lookahead pattern, not just the middle part. – blhsing Jun 09 '23 at 02:38

2 Answers2

1

You could split the regex into several bits instead of using the or operator "|"

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

lr <- list()
lr[1] <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+)\\b(?:\\W+\\w+){0,5}"
lr[2] <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}"

sapply(lr, function(x) stringr::str_extract_all(test_str, pattern = x))

[[1]]
[1] "Made out of wood, a book can contain many pages that"

[[2]]
[1] "a book can contain many pages that are used to transmit"
Gerald T
  • 704
  • 3
  • 18
  • Thank you, I also thought about this. But this would only solve the problem for cases where the matches of the different regex overlap. For cases where the same word (e.g. "book") is mentioned multiple times within the overlapping parts of two matches, it would still exclude the second match. – Rasul89 Jun 08 '23 at 12:39
0

As shown in this answer, you can enclose what you want to capture in a positive lookahead pattern, and then replace match.length with the capture.length attribute to allow the otherwise zero-length match to actually cover what's captured.

A secondary problem arises when you use a lookahead pattern for captures because you want to match "up to" 5 words before and after a keyword, and every word within 5 words of the keyword can satisfy the assertion if you use only a simple quantifier like (?:\\w+\\W+){0,5}. Instead, since you only want to match less than 5 words before a keyword when the preceding words start from the beginning of a line, include ^(?:\\w+\\W+){0,4} as an alternation pattern. The same idea applies to matching "up to" 5 words that follow the keyword:

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex <- "(?i)(?=(\\b(?:(?:\\w+\\W+){5}|^(?:\\w+\\W+){0,4})(?:\\bbooks?|\\bpages?)\\b(?:(?:\\W+\\w+){5}|(?:\\W+\\w+){0,4}$)))"

m <- gregexpr(my_regex, test_str, perl=TRUE)
m <- lapply(m, function(i) {
       attr(i, "match.length") <- attr(i, "capture.length")
       i
     })
regmatches(test_str, m)

Demo: https://ideone.com/wZPdd2

blhsing
  • 91,368
  • 6
  • 71
  • 106