0

I am trying to learn regex and I fail to understand the differences in the usage of many code

/apples??/
/.*?[0-9]*/
/.*?[0-9]+/

On the result I understood that if I use ? with any thing it makes that step optional (or just neglecting it)

e.g.: the first command will only match apple and not apples the second command will match 26 in page 26

I expected that the 3rd command will do the same but it matched the whole page 26 string.

B Luthra
  • 153
  • 9

3 Answers3

2

? means many different things in a regex, depending on context.

When used as a quantifier it means "optional". For example, [0-9]? matches 0 or 1 digits (an optional digit).

However, when applied to another quantifier, it means "non-greedy". Normal quantifiers try to match as many times as possible (giving up matches only if the rest of the regex would fail otherwise). Non-greedy matching inverts this: It first tries to match as few times as possible and only matches more if the rest of the regex would fail otherwise.

For example, a[a-z]*b applied to abcdebafcbad matches abcdebafcb (or (abcdebafcb)ad(using parentheses to mark the match)). [a-z]* will first consume the whole string, only giving back characters until the following b can match.

However, a[a-z]*?b applied to abcdebafcbad matches ab (or (ab)cdebafcbad(using parentheses to mark the match)). [a-z]*? first starts out by matching no characters at all, and because b is able to match immediately, that's where the regex stops.

As for your examples:

The first thing you need to understand about regexes is the outer loop. There is a loop that tries to invoke the regex engine at every position in the input string, from left to right. If a match is found, the loop stops and reports success. If all positions are exhausted without finding a match, the loop stops and reports failure.

(Strictly speaking we're not looping over the characters in the input string, but over the gaps between characters. ab has 3 possible match positions: before a, between a and b, and after b.)

Your first regex, apples??, is equivalent to apple(?:|s). We match apple followed by either an empty string or s (i.e. we try to match nothing first). Since the empty regex always matches and this is the end of the regex, there's nothing that could come later on and force us to revisit this decision. The regex is equivalent to just apple.

Your second regex, .*?[0-9]*, is a bit curious. The first thing you should notice is that all parts of it are optional, i.e. it can match a string of length 0. Because an empty string can be matched anywhere in the input and the outer loop mentioned above starts at offset 0, this regex will always match at the beginning of the input string.

The second thing to notice is that it attempts to reimplement the outer loop within the regex: .*? will consume 0 characters at first, then (if the rest of the regex doesn't match) 1 characters, then 2, ... until a match is found. This is a bit pointless because the outer loop already does exactly that.

If the input string is page 26, .*? will start out by matching as few characters as possible, i.e. none. Then [0-9]* will try to match as many digits as possible, but p is not a digit, so "as many as possible" is also none. Thus .*?[0-9]* matches the empty string at the beginning of page 26: ()page26 (using parentheses to mark the match).

Your third regex, .*?[0-9]+, still contains that redundant explicit loop at the beginning, .*?. But now the digits part is not optional: [0-9]+ requires at least one digit to match.

If the input string is page 26, .*? will start out by matching as few characters as possible, i.e. none. Then [0-9]+ will try to match as many digits as possible, but at least one. This fails because p is not a digit. Because [0-9]+ failed, we backtrack into .*? and try to consume one more character (p). Then we try [0-9]+ against the remaining input string, age 26. This also fails; we backtrack and consume one more character in .*? (pa). Then we try [0-9]+ against ge 26, which still fails. ...

This continues until .*? has consumed page . At this point [0-9]+ finally finds a digit to match, 2. Because + is greedy, we consume all available digits at this position, 26. The final match is (page 26) (i.e. the whole input string) with .*? matching page  and [0-9]+ matching 26.

melpomene
  • 84,125
  • 8
  • 85
  • 148
  • in **.*?[0-9]*** why is this [0-9]* optional .This is not a lazy expression .i thought when when going through the string page 26 when [0-9] returns no value then it should pass expressions on to .*? which should match page too . – B Luthra Apr 21 '19 at 10:31
  • @BLuthra `*` means "0 or more", so it's effectively optional because it can match 0 times. On the other hand, `+` means "1 or more". `A+` is equivalent to `AA*`. – melpomene Apr 21 '19 at 10:34
  • but 0 or more but as it's greedy won't it prefer max string over 0 string – B Luthra Apr 21 '19 at 11:24
  • @BLuthra Yes, but `p` is not a digit, so the longest possible match at this point is 0. – melpomene Apr 21 '19 at 11:25
  • Re "*This is a bit pointless because the outer loop already does exactly that.*", Indeed. You can imagine every pattern being preceded by an implicit `\G.*?\K`. – ikegami Apr 21 '19 at 11:43
  • @melpomene i am again confused in the usage lazy expression e.g.: o.*?a to find highlighted text in string foobaraz .The answer came out ooba .Shouldn't it match nothing as it will prefer nothing over one or anything – B Luthra Apr 21 '19 at 16:29
  • @BLuthra `.*?` will start by matching nothing, But then `a` in the regex wants to match an 'a', which fails (the current character is 'o'). So then `.*?` will slowly expand to match more and more until `a` is satisfied, which happens when `.*?` matches `ob`. – melpomene Apr 21 '19 at 16:31
  • Re "*Shouldn't it match nothing*", No, it matches the least possible **to successfully match at the current position**. `o.*?a` can successfully match `ooba` at position 0, and to do so, the `.*?` must match `ob`. – ikegami Apr 21 '19 at 16:37
  • @ikegami as I'm using * it's least is nothing(0 or more ) so why didn't it match nothing – B Luthra Apr 21 '19 at 17:13
  • @B Luthra, It can't match nothing because the first `o` isn't followed by `a`. So the engine then tries having `.*?` match one character. The pattern then still fails to match. So the engine now tries having `.*?` match two characters. Finally, the pattern matches. – ikegami Apr 21 '19 at 17:14
  • @ikegami then why didn't it match page when using /.*?[0-9]*/ as least match should be page – B Luthra Apr 21 '19 at 17:17
  • At position 0, the engine tries having `.*?` match 0 zero characters. So far so good. Still at position 0, the engine matches as many digits as possible. It succeeds in matching zero of them (since there's a `p` at position 0). We have a match! – ikegami Apr 21 '19 at 17:19
  • @ikegami i am trying to use **/.*?[0-9]*/** to match the string page 26 but it only matches 26 .in this e.g. it's(**.*?**) minimum is 0 but in this o.*?a it's minimum value become one match .Why – B Luthra Apr 21 '19 at 17:24
  • Re "*i am trying to use `/.*?[0-9]*/` to match the string page 26 but it only matches 26*", You are mistaken. It matches zero characters at position 0. – ikegami Apr 21 '19 at 17:30
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192178/discussion-between-b-luthra-and-ikegami). – B Luthra Apr 21 '19 at 17:33
1

Bash regex doesn't support lazy quantifier after *, +, ?, so *? +? ?? will outrightly fail
but, if () group is referred back by ${BASH_REMATCH[1]} command, such;

$ [[ "apples " =~ (apples?) ]] && echo ${BASH_REMATCH[1]}

apples

$ [[ "page 26" =~ (.*[0-9]+) ]] && echo ${BASH_REMATCH[1]}

page 26

$ [[ "page 26" =~ .*\ ([0-9]+) ]] && echo ${BASH_REMATCH[1]}

26

works

Adriano
  • 3,788
  • 5
  • 32
  • 53
1

[ This complements the existing answers rather than being an answer in of itself. ]

The following shows the matching process for the examples you provided. (The regex engine may actually take invisible shortcuts to increase performance.)


#0123456
"apples" =~ /apples??/
  1. At position 0, apple matches 5 characters ⇒ Position 5.
    1. At position 5, s?? matches 0 characters* ⇒ Position 5.
      1. Success! Pattern matches 5 characters starting at position 0 ("apple")!

#01234567
"page 26" =~ /.*?[0-9]*/
  1. At position 0, .*? matches 0 characters* ⇒ Position 0.
    1. At position 0, [0-9]* matches 0 characters ⇒ Position 0.
      1. Success! Pattern matches 0 characters starting at position 0 ("")!

#01234567
"page 26" =~ /.*?[0-9]+/
  1. At position 0, .*? matches 0 characters* ⇒ Position 0.
    1. At position 0, [0-9]+ fails to match ⇒ Backtrack!
  2. At position 0, .*? matches 1 character ⇒ Position 1.
    1. At position 1, [0-9]+ fails to match ⇒ Backtrack!
  3. At position 0, .*? matches 2 character ⇒ Position 2.
    1. At position 2, [0-9]+ fails to match ⇒ Backtrack!
  4. At position 0, .*? matches 3 character ⇒ Position 3.
    1. At position 3, [0-9]+ fails to match ⇒ Backtrack!
  5. At position 0, .*? matches 4 character ⇒ Position 4.
    1. At position 4, [0-9]+ fails to match ⇒ Backtrack!
  6. At position 0, .*? matches 5 character ⇒ Position 5.
    1. At position 5, [0-9]+ matches 2 characters ⇒ Position 7.
      1. Success! Pattern matches 7 characters starting at position 0 ("page 26")!

#01234
"ooba" =~ /o.*?a/
  1. At position 0, o matches 1 character ⇒ Position 1.
    1. At position 1, .*? matches 0 characters* ⇒ Position 1.
      1. At position 1, a fails to match ⇒ Backtrack!
    2. At position 1, .*? matches 1 characters ⇒ Position 2.
      1. At position 2, a fails to match ⇒ Backtrack!
    3. At position 1, .*? matches 2 characters ⇒ Position 3.
      1. At position 3, a matches 1 character ⇒ Position 4.
        1. Success! Pattern matches 4 characters starting at position 0 ("ooba")!

* — Non-greedy because of the ?, so start by trying to match the least possible at the current position.

ikegami
  • 367,544
  • 15
  • 269
  • 518