3

I have a question about regex. I don't know why I cannot do the following.

Sample sentence:

"This is a test string with five t's"

The regex I use:

^(.*?(?=t)){3}

I want the regex to match the following.

"This is a test s"

But it doesn't work, does anyone know why?

Right leg
  • 16,080
  • 7
  • 48
  • 81
Cam
  • 33
  • 5
  • 1
    `*` matches *zero or more* of the previous token, you want `+` instead (for *one or more*) – CertainPerformance Oct 25 '18 at 09:34
  • Did my answer help you? – Right leg Oct 25 '18 at 09:51
  • The point here is that the whole `.*?(?=t)` group pattern can match an empty string. It stops before the first `t` and cannot "hop thru" because it remains where it is when the lookahead pattern matches. You cannot do it like this, you must consume (and move the regex index) at least one char. An alternative solution for this concrete case is `^(?:[^t]*t){2}[^t]*`. Or, a general case solution: `^(?:.*?t){2}(?:(?!t).)*` – Wiktor Stribiżew Oct 25 '18 at 10:18
  • Thanks very much guys. I read your answer at the time you post but reply you. Because of I'm a newbie to regex. I don't know much about the theory of regex, eg. regex index, how each step goes of regex. So I spent 2 days to figure out what you says and try and error with your answer. Now seems I understand something but not deeply. After lots of tests, I think the answer of @Wiktor Stribiżew is better, thanks very much. Helps me a lot to advance the skills of regex. – Cam Oct 27 '18 at 14:14
  • If you do not understand any specific part here, just drop a coment below my answer. – Wiktor Stribiżew Oct 27 '18 at 14:18
  • By the way, do you guys have any resources that I can learn more details of the theory of regex? Likes the things Wiktor Stribiżew mentioned, "regex index", implementation of each step of regex. – Cam Oct 27 '18 at 14:20

2 Answers2

1

The point here is that the whole .*?(?=t) group pattern can match an empty string. It stops before the first t and cannot "hop thru" because it remains where it is when the lookahead pattern (a non-consuming pattern) matches.

You cannot do it like this, you must consume (and move the regex index) at least one char.

An alternative solution for this concrete case is

^(?:[^t]*t){2}[^t]*

See the regex demo, the ^(?:[^t]*t){2}[^t]* matches the start of string (^), then consumes two occurrences ({2}) of any chars other than t ([^t]*) followed with t, and then again consumes two occurrences ({2}) of any chars other than t.

Or, a general case solution (if t is a multicharacter string):

^(?:.*?t){2}(?:(?!t).)*

See another regex demo. The (?:.*?t){2} pattern matches two occurrences of any 0+ chars, as few as possible, up to the first t, and then (?:(?!t).)* matches any char, 0+ occurrences, that does not start a t char sequence.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Why you add non-capturing group (?:) here? I just know the usage of non-capturing group that is for not as being a result for replacement. – Cam Oct 27 '18 at 14:23
  • @Cam I used it because I don't need to access the text matched with this group. If you only need to group some patterns to quantify their sequence, there is no point using capturing groups as only the last occurrence is saved as the group value (unless it is .NET or Python PyPi regex). See more about [non-capturing groups](https://stackoverflow.com/questions/3512471). – Wiktor Stribiżew Oct 27 '18 at 14:51
0

As said by @CertainPerformance, .* will match zero or more characters in the pattern, but you use its lazy version .*?. The lazy version of a quantifier will have it match as few characters as possible. With a quantifier that matches the empty string, this will always lead to a zero-length match.

You need to use the + quantifier instead`, in order to prevent an empty string match.

Demonstration with Python:

>>> import re
>>> s = "This is a test string with five t's"
>>> r = r'^(.+?(?=t)){3}'
>>> re.match(r, s)
<_sre.SRE_Match object; span=(0, 16), match='This is a test s'>
Right leg
  • 16,080
  • 7
  • 48
  • 81