1

Regex: /^(\d+)[^_]/gm
Test String: 12_34

I'd expect this regex not to match on test string, because \d+ is greedy eating the digits 1 and 2 and [^_] fails on _.

But it unexpected matches with only 1 in Group1. Where am I wrong?

I try to find a regular expression that matches the digits in test strings "12" or "12xx" but does not match on "12_xx"

Sample: https://regex101.com/r/0QRTjs/1/
Dialect: In the end I'll use Microsoft System.Text.RegularExpressions.

schiv
  • 23
  • 2
  • You should read about [Backtracking](https://stackoverflow.com/q/9011592/8967612) and [Atomic Groups](https://stackoverflow.com/q/14411818/8967612). Basically, "greedy" means "as much as possible **with backtracking allowed**". The behavior that you were expecting can be achieved with an atomic group. – 41686d6564 stands w. Palestine Aug 02 '21 at 14:27
  • As to your particular example, you can just use a negative Lookahead since you probably don't need to include the character after the digits in the match: `^(\d+)(?!\d|_)`. See [this demo](https://regex101.com/r/Lt7Zhg/1). – 41686d6564 stands w. Palestine Aug 02 '21 at 14:32
  • OK, so the feature **Backtracking** leads to my "unexpected behaviour", because the regexp does more than I thought. And I can use **Atomic Groups** to avoid the Backtracking. `^((?>\d*))[^_]` seems to do the trick. Thank you Ahmed! – schiv Aug 02 '21 at 14:39
  • `^((?>\d*))([^_]|$)` to correctly match my digit-only line. – schiv Aug 02 '21 at 14:47

2 Answers2

0

\d+ will match with one or more digits.
Since you append [^_], it can only be followed by a non _ character.
Therefore \d+ cannot match 12 because it is followed by _.
1 is the first matching group because it is followed by 2 which is not _.

If you want to match lines with digits only there is a very simple expression:

^(\d+)$
Tranbi
  • 11,407
  • 6
  • 16
  • 33
0

\d+ has the ability to reduce the number of matches if that results in an overall match. By backtracking then 2 satisfies the match of [^_] and 1 is captured.

See HERE

You can use a negative lookahead at the start of the match:

/^(?!\d+_)(\d+)/

See HERE

Or you can use an atomic group that disallows backtracking:

/^((?>\d+))(?:[^_]|$)/

See HERE

Or use a possessive quantifier of ++ which disallows backtracking:

/^\d++([^_]|$)/

See HERE

The possessive quantifier is likely the fastest...

dawg
  • 98,345
  • 23
  • 131
  • 206