-1

Anyone know why these two regexes give different results when trying to match either '//' or '$'? (Python 3.6.4)

  • (a)(//|$) : Matches both 'a' and 'a//'
  • (a)(//)|($) : Matches 'a//' but not 'a'
    >>> at = re.compile('(a)(//|$)')
    >>> m = at.match('a')
    >>> m
    <_sre.SRE_Match object; span=(0, 1), match='a'>
    >>> m = at.match('a//')
    >>> m
    <_sre.SRE_Match object; span=(0, 3), match='a//'>
    >>> 

vs

    >>> at = re.compile('(a)(//)|($)')
    >>> m = at.match('a//')
    >>> m
    <_sre.SRE_Match object; span=(0, 3), match='a//'>
    >>> m = at.match('a')
    >>> m
    >>> type(m)
    <class 'NoneType'>
    >>>
Jo-el
  • 390
  • 3
  • 8

2 Answers2

2

The regex engine will group the expressions on each side of a pipe before evaluating. In the first case

  • (a)(//|$)

    implies it'll match a string that must have an a before either // or $ (i.e EOL)

    Hence, first alternative in this case is // and second alternative is $, both must follow an a

    In this expression, the capturing groups are

    • a
    • Either // or $
  • (a)(//)|($)

    implies it'll match a string that must be either a// or $

    Hence, first alternative in this case is a// and second alternative is $

    In this expression, the capturing groups are

    Either

    • a
    • //

    OR

    • $

In fact, the grouping doesn't matter in the second example, a//|$ will give the same result, since the regex engine will evaluate it as (a//)|$ (note the parentheses are just symbolic for my example, they do not represent capture group syntax).

Try it out in a regex tester. It'll tell you what the alternatives are for each expression

Chase
  • 5,315
  • 2
  • 15
  • 41
  • That makes sense! Thanks for your help :) – Jo-el Feb 27 '20 at 00:32
  • @Jo-el Welcome to stack overflow, if you found what you were looking for, you should click the tick icon next to the answer to indicate correct answer. This ensures if people come here again with the same question, they can view the correct answer at the top. – Chase Feb 27 '20 at 07:43
  • For some reason I didn't see the checkmark last time I checked here - maybe my login expired or something - I had thought that maybe I couldn't 'accept' an answer anymore now that this was dup'ed [with a question that doesn't cover the specifics of this case..]. Thanks again :) – Jo-el Feb 28 '20 at 17:42
  • Actually, followup on this. Is the right way to think about | that it's trying to capture the whole "word" before and after it? For instance `1 2|3` should match `1 2` or `1 3`? So my mistake was thinking that adding groups around the (a) and (//) would break the precedence, when in practice to regex they are still the same "word". Is that correct? – Jo-el Feb 28 '20 at 17:47
  • The `|` has the lowest precedence, when you don't use `(`, `)` it'll assume **everything** before it is the *first alternative* and **everything** after it will be considered as the *second alternative*. `1 2|3` will match either `1 2` or `3` More [info](https://www.regular-expressions.info/alternation.html) – Chase Feb 28 '20 at 18:06
1

| has low precedence, so (a)(//)|($) means ((a)(//))|($), therefore it will either math ((a)(//)) or ($). To achieve the results like first one, use (a)((//)|($)), which is same as first with groups added. First regex is cleaner and should be preferred unless you need group matching.

See here for more details on precedence - https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08