Regex: Is there a way to not consume words that are captured?

Question

I am trying to extract 3 words before and after a given word using regex in python. It works well for most of the cases, but the issue occurs when there are 2 of the same given words within the 3 words region as per the code snippet below (The given word is "hello").

new_text = "I am going to say hello and hello to him"
re.findall(r"((?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,3})(hello)((?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,3})", new_text)

Expected Output:

[('going to say ', 'hello', ' and hello to'), ('say hello and ', 'hello', ' to him')]

Actual Output:

[('going to say ', 'hello', ' and hello to')]

From my research, it is due to regex consuming the words that it matches and therefore it is not able to process my second "hello". I will need to capture the region as I will be doing additional processing to it.

Any advice on how to proceed will be greatly appreciated (Regex or non-regex).

Thanks!

Please have a look at https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches — Carlos Horn, Mar 09 '22 at 08:45

Cary Swoveland · Accepted Answer · 2022-03-09T18:48:38.840

You can match the following regular expression:

r'(?=\b((?:\w+ +){2}\w+) +(hello) +((?:\w+ +){0,2}\w+\b))'

Demo

There are three capture groups. They contain:

a string of three words preceding 'hello'
the word 'hello'
a string of one to three words following 'hello', as many as possible.

The link shows that there are two matches of the string:

"I am going to say hello and hello to him"

1st match (zero-width, before 'g' in 'going')

Capture group 1: "going to say"
Capture group 2: "hello"
Capture group 3: "and hello to"

2nd match (zero-width, before 's' in 'say')

Capture group 1: "say hello and"
Capture group 2: "hello"
Capture group 3: "to him"

Note that saving 'hello' to a capture group is really unnecessary because there will not be a match if 'hello' is not present in its required position. Observe also that I have constructed the regular expression in such a way that all capture groups begin and end with a word character (rather than with a space as shown in the question).

The regular expression can be broken down as follows. (Note that I show a space as a character class containing one space, merely to make the space character visible to the reader.)

(?=              # begin positive lookahead
  \b             # match word boundary
  (              # begin capture group 1
    (?:\w+[ ]+)  # match >= 1 word chars followed by >= 1 spaces 
    {2}          # execute preceding non-capture group 2 times
    \w+          # match >= 1 word chars
  )              # end capture group 1
  [ ]+           # match >= 1 spaces
  (hello)        # match literal and save to capture group 2
  [ ]+           # match >= 1 spaces
  (              # begin capture group 3
    (?:\w+[ ]+)  # match >= 1 word chars followed by >= 1 spaces 
    {0,2}        # execute preceding non-capture group 0-2 times
    \w+          # match >= 1 word chars
    \b           # match a word boundary
  )              # end capture group 3
)                # end positive lookahead

Thanks for the detailed explanation! I didn't think of enclosing a capture group inside positive lookahead. — Wei Feng, Mar 10 '22 at 07:26

Regex: Is there a way to not consume words that are captured?

1 Answers1