0

I am trying to match only paragraph indexes where the "TEST" is inside but my regex is matching also paragraph without it because there is "TEST" in the next one.

Can you help me and elaborate how to, in general, match only first occurence of pattern BEFORE some other patter?

asdasdasda
2.1 adasdasdasdasdwvwetwevtwtv
wetvwetv TEST wqrqwvrqw
qwvrqwvqwr
2.2 whergtvwe
wetvwetvwetveatw
evtwet
2.3 eyrnenytunrunert
vqevrerwv TEST aevtawtvwetv

^(\d+.\d+)(?=.*?TEST)

2 Answers2

1

This regex only matches characters that are not followed by \d\.\d: demo

^(\d+\.\d+)(?=(?:.(?!\d\.\d))*TEST)

Also the period between the numbers must be escaped if you want it to only match a period instead of being a wildcard.

EDD
  • 2,070
  • 1
  • 10
  • 23
1

The "paragraphs" of interest can be obtained by matching the following regular expression.

^\d+\.\d+\s(?:(?!^\d+\.\d+\s).)*\bTEST\b(?:(?!^\d+\.\d+\s).)*

with the following flags:

  • g: "global", do not return after the first match
  • m: "multiline", causing '^' and '$' to respectively match the beginning of a line (as opposed to matching the beginning and end of the string)
  • s: "single-line mode", . matches all characters, including line terminators

Demo


The expression can be broken down as follows.

^                # match beginning of a line
\d+\.\d+\s       # match 1+ digits then '.' then 1+ digits then a whitespace 
(?:              # begin a non-capture group
  (?!            # begin a negative lookahead
    ^            # match beginning of a line
    \d+\.\d+\s   # match 1+ digits then '.' then 1+ digits then a whitespace 
  )              # end the negative lookahead
  .              # match any character, including line terminators
)                # end non-capture group
*                # execute the non-capture group 0+ times
\bTEST\b         # match 'TEST' with word breaks on both sides
(?:              # begin a non-capture group
  (?!            # begin a negative lookahead
    ^            # match beginning of a line
    \d+\.\d+\s   # match 1+ digits then '.' then 1+ digits then a whitespace 
  )              # end the negative lookahead
  .              # match any character, including line terminators
)                # end non-capture group
*                # execute the non-capture group 0+ times

The technique of matching one character at a time with a negative lookahead (here (?:(?!^\d+\.\d+\s).)) is called the tempered greedy token solution.


Note that there is quite a bit of duplication in this regular expression. Many regex engines permit the use of subroutines (or subexpressions) to reduce the duplication. With the PCRE engine (which I used at the "Demo" link), for example, you could write

(^\d+\.\d+\s)((?:(?!(?1)).)*)\bTEST\b(?2)

Demo

Here (?1) is replaced by the expression for capture group 1, ^\d+\.\d+\s and (?2) is replaced by the expression for capture group 2, (?:(?!(?1)).)*.

This is perhaps more clear if we used named capture groups.

(?P<float>^\d+\.\d+\s)(?P<beforeTEST>(?:(?!(?P>float)).)*)\bTEST\b(?P>beforeTEST)

Demo

One advantage of the use of subroutines is that it avoids some cut-and-paste copying errors.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100