Match multiple regex groups starting after a specific word/pattern within the text

Question

I'm trying to match all instances of a percentage (e.g. 20%) AFTER a specific pattern (or in this case a word):

Lorem ipsum dolor 10% sit amet, consectetur adipiscing elit. Morbi et 
feugiat Discount vitae 15% urna. Sed 20% et lorem in dapibus. 
Mauris arcu dui, vestibulum eget eros eu, eleifend luctus risus.

I want to match the 15% and 20%, but not the 10%. It should determine this by making sure the percentages it's matching occur after the word Discount appears.

This is the pattern I came up with but it seems to match all percentages:

(?<=Discount)*(\d+%)+

This would using the C# / .NET regex engine.

One way is like this `(?<=Discount.*)\b\d+%` https://regex101.com/r/IuDSff/1 — The fourth bird, Aug 04 '23 at 20:39
Wow, I didn't even think to put the capture in with the Discount. I gotta say I'm still slightly confused why mine doesn't work. — test, Aug 04 '23 at 20:40
@Thefourthbird Unfortunately, most regexp flavors don't allow variable-length lookbehinds. — Barmar, Aug 04 '23 at 20:40
@Thefourthbird if you throw that in an answer I'll accept it. Would like to understand why the capture outside of the lookbehind makes it fail — test, Aug 04 '23 at 20:42
@test Should the value after "Discount" be in the same line, and what if "Discount" occurs multiple times? — The fourth bird, Aug 04 '23 at 20:43
In my case Discount would only appear once, the %s could be anywhere after that — test, Aug 04 '23 at 20:44
Also you could use a [`\G`](https://www.regular-expressions.info/continue.html) based pattern: [`(?:\G(?!^)|Discount).*?(\d+%)`](https://regex101.com/r/LrOWkY/1) — bobble bubble, Aug 04 '23 at 22:08

The fourth bird · Accepted Answer · 2023-08-04T20:54:16.800

In the pattern (?<=Discount)*(\d+%)+ you are optionally repeating a lookbehind assertion that only asserts the word "Discount" directly to the left of the current position, so 0 times would also suffice and you will match all occurrences of (\d+%)+

If you want a value only you don't need a capture group, as this pattern (\d+%)+ repeats 1+ times 1+ digits and %

To get a value only, you could write the pattern like this and use word boundaries to prevent partial word matches:

(?<=\bDiscount\b.*)\b\d+%

The pattern matches:

(?<= Postive lookbehind assertion
- \bDiscount\b.* Match the word "Discount" followed by 0+ times any character except newlines (as there are other characters in between "Discount" and the \d+% pattern)
) Close the lookbehind
\b A word boundary
\d+% Match 1+ times any digit and %

Regex demo

In .NET you could also make use of repeating capture group using the Group.Captures Property

\bDiscount\b(?:.*?(\b\d+%))+

Regex demo

Cary Swoveland · Answer 2 · 2023-08-05T00:22:59.917

Rather than using a (variable-length) positive lookbehind, as Bird #4 has done, you could use a (variable-length) negative lookahead:

\b\d+%(?!.*\bDiscount\b)

Demo

The regular expression can be broken down as follows.

\b          # match a word boundary
\d+%        # match one or more (+) digits (`\d`) followed by '%' 
(?!         # begin a negative lookahead
  .*        # match zero or more (*) characters other than line terminators
  \b        # match a word boundary
  Discount  # match 'Discount'
  \b        # match a word boundary
)           # end the negative lookahead

Note that C++ is one of the relatively few languages that support variable-length (positive and negative) lookbehinds. Most mainstream languages have regex engines that support variable-length (positive and negative) lookaheads but not variable-length lookbehinds. That includes PHP, Perl, Python (standard regex engine), R, Ruby and Java. The upshot is that the lookahead solution would be advised if it were thought that the code might be ported from C++ to a different language.

I cannot say whether a negative lookahead would tend to be more efficient than a positive lookbehind here.

Match multiple regex groups starting after a specific word/pattern within the text

2 Answers2