Regex look ahead matching to greedy backwords

Question

I am trying to match only paragraph indexes where the "TEST" is inside but my regex is matching also paragraph without it because there is "TEST" in the next one.

Can you help me and elaborate how to, in general, match only first occurence of pattern BEFORE some other patter?

asdasdasda
2.1 adasdasdasdasdwvwetwevtwtv
wetvwetv TEST wqrqwvrqw
qwvrqwvqwr
2.2 whergtvwe
wetvwetvwetveatw
evtwet
2.3 eyrnenytunrunert
vqevrerwv TEST aevtawtvwetv

^(\d+.\d+)(?=.*?TEST)

Can you add the expected output ? – SelVazi Feb 25 '23 at 18:57 — SelVazi, Feb 25 '23 at 18:57
2.1 and 2.3. Two matches – Wojciech Rogman Feb 25 '23 at 21:04 — Wojciech Rogman, Feb 25 '23 at 21:04

score 1 · Accepted Answer · answered Feb 25 '23 at 19:08

1

This regex only matches characters that are not followed by \d\.\d: demo

^(\d+\.\d+)(?=(?:.(?!\d\.\d))*TEST)

Also the period between the numbers must be escaped if you want it to only match a period instead of being a wildcard.

answered Feb 25 '23 at 19:08

EDD

2,070
1
10
23

Note that `"2.1"` is not matched if the line beginning `"2.1"` were, for example, `"2.1 adas16.4dwv"`. – Cary Swoveland Feb 26 '23 at 00:03

Cary Swoveland · Answer 2 · 2023-02-26T02:30:11.760

The "paragraphs" of interest can be obtained by matching the following regular expression.

^\d+\.\d+\s(?:(?!^\d+\.\d+\s).)*\bTEST\b(?:(?!^\d+\.\d+\s).)*

with the following flags:

g: "global", do not return after the first match
m: "multiline", causing '^' and '$' to respectively match the beginning of a line (as opposed to matching the beginning and end of the string)
s: "single-line mode", . matches all characters, including line terminators

Demo

The expression can be broken down as follows.

^                # match beginning of a line
\d+\.\d+\s       # match 1+ digits then '.' then 1+ digits then a whitespace 
(?:              # begin a non-capture group
  (?!            # begin a negative lookahead
    ^            # match beginning of a line
    \d+\.\d+\s   # match 1+ digits then '.' then 1+ digits then a whitespace 
  )              # end the negative lookahead
  .              # match any character, including line terminators
)                # end non-capture group
*                # execute the non-capture group 0+ times
\bTEST\b         # match 'TEST' with word breaks on both sides
(?:              # begin a non-capture group
  (?!            # begin a negative lookahead
    ^            # match beginning of a line
    \d+\.\d+\s   # match 1+ digits then '.' then 1+ digits then a whitespace 
  )              # end the negative lookahead
  .              # match any character, including line terminators
)                # end non-capture group
*                # execute the non-capture group 0+ times

The technique of matching one character at a time with a negative lookahead (here (?:(?!^\d+\.\d+\s).)) is called the tempered greedy token solution.

Note that there is quite a bit of duplication in this regular expression. Many regex engines permit the use of subroutines (or subexpressions) to reduce the duplication. With the PCRE engine (which I used at the "Demo" link), for example, you could write

(^\d+\.\d+\s)((?:(?!(?1)).)*)\bTEST\b(?2)

Demo

Here (?1) is replaced by the expression for capture group 1, ^\d+\.\d+\s and (?2) is replaced by the expression for capture group 2, (?:(?!(?1)).)*.

This is perhaps more clear if we used named capture groups.

(?P<float>^\d+\.\d+\s)(?P<beforeTEST>(?:(?!(?P>float)).)*)\bTEST\b(?P>beforeTEST)

Demo

One advantage of the use of subroutines is that it avoids some cut-and-paste copying errors.

Regex look ahead matching to greedy backwords

2 Answers2

Linked