1

I'm trying to match all instances of a percentage (e.g. 20%) AFTER a specific pattern (or in this case a word):

Lorem ipsum dolor 10% sit amet, consectetur adipiscing elit. Morbi et 
feugiat Discount vitae 15% urna. Sed 20% et lorem in dapibus. 
Mauris arcu dui, vestibulum eget eros eu, eleifend luctus risus.

I want to match the 15% and 20%, but not the 10%. It should determine this by making sure the percentages it's matching occur after the word Discount appears.

This is the pattern I came up with but it seems to match all percentages:

(?<=Discount)*(\d+%)+

Match/Groups from regex101.com

This would using the C# / .NET regex engine.

test
  • 2,589
  • 2
  • 24
  • 52

2 Answers2

3

In the pattern (?<=Discount)*(\d+%)+ you are optionally repeating a lookbehind assertion that only asserts the word "Discount" directly to the left of the current position, so 0 times would also suffice and you will match all occurrences of (\d+%)+

If you want a value only you don't need a capture group, as this pattern (\d+%)+ repeats 1+ times 1+ digits and %

To get a value only, you could write the pattern like this and use word boundaries to prevent partial word matches:

(?<=\bDiscount\b.*)\b\d+%

The pattern matches:

  • (?<= Postive lookbehind assertion
    • \bDiscount\b.* Match the word "Discount" followed by 0+ times any character except newlines (as there are other characters in between "Discount" and the \d+% pattern)
  • ) Close the lookbehind
  • \b A word boundary
  • \d+% Match 1+ times any digit and %

Regex demo


In .NET you could also make use of repeating capture group using the Group.Captures Property

\bDiscount\b(?:.*?(\b\d+%))+

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

Rather than using a (variable-length) positive lookbehind, as Bird #4 has done, you could use a (variable-length) negative lookahead:

\b\d+%(?!.*\bDiscount\b)

Demo


The regular expression can be broken down as follows.

\b          # match a word boundary
\d+%        # match one or more (+) digits (`\d`) followed by '%' 
(?!         # begin a negative lookahead
  .*        # match zero or more (*) characters other than line terminators
  \b        # match a word boundary
  Discount  # match 'Discount'
  \b        # match a word boundary
)           # end the negative lookahead

Note that C++ is one of the relatively few languages that support variable-length (positive and negative) lookbehinds. Most mainstream languages have regex engines that support variable-length (positive and negative) lookaheads but not variable-length lookbehinds. That includes PHP, Perl, Python (standard regex engine), R, Ruby and Java. The upshot is that the lookahead solution would be advised if it were thought that the code might be ported from C++ to a different language.

I cannot say whether a negative lookahead would tend to be more efficient than a positive lookbehind here.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100