1

I have some text with lowercase letters, dots, parentheses, and greater than and less than signs. Suppose that I want to match substrings on each line that (1) begin with a period, (2) contain any number of letters, and (3) have a non-negative number of either parentheses or </> signs, but not both. Therefore, given this text,

foobar.hello(world)
foobar.hello<world>
foobar.hello>>>world<>(baz)

I want to match .hello(world) on the first line, .hello<world> on the second line, and .hello>>>world<> on the third (since I can't mix parentheses and </> signs).

I could use two regular expressions to match my desired strings, \.[a-z()]+ and \.[a-z<>]+. However, because regexes are more efficient when similar patterns are combined, I tried to combine them into a single regex with a logical OR |:

\.(?:[a-z()]+|[a-z<>]+)

After trying this online, while the regex matched my desired substring for the first line, for the second and third lines, it only matched .hello. Yet when I switch the order of the elements, the opposite happens—the first line gets matched as .hello, and the second and third lines are matched as desired. This comes to me as a surprise, since I wouldn't think order would matter with an OR operator. What's happening here?

gz839918
  • 111
  • 5
  • 1
    In fact, order does matter, even in programming languages. This behaviour is called [short-circuitry](https://stackoverflow.com/a/56057305). – InSync Jul 29 '23 at 10:02

2 Answers2

2

Your problem is that [a-z()]+ does not require brackets in the input, so it matches hello. Further, alternations match left-to-right, so .hello is successfully matched and the engine stops there when the input has angled brackets.

To fix the problem, require brackets (that agree with each other) via an alternation for the brackets and their contents, placed after the initial part, so:

\.[a-z]+(?:\([a-z]+\)|<[a-z]+>)

See live demo.

This approach has the (possible) advantage of being more strict; your regex would match everything from dot onwards of

foobar.()()()(((((()foo

which is probably not wanted (although the input may not have syntactically invalid content).

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • To be honest, I actually intended for the regex to match monstrosities with unpaired delimiters like `foobar.()()()(((((()foo`, at least in theory, anyways; my real use cases aren’t so horrendous. I think my example gave the wrong impression, and I’ve edited my question accordingly. I’m more interested in an explanation than a fix, so this was helpful anyways (+1) – gz839918 Jul 29 '23 at 16:28
0

The regex OR operator terminates after the first successful match. This is the short circuiting behavior described by InSync's comment. To make this more obvious, if our regular expression is a+|a+b+|a+b+c+, then no matter the string, our regex can only return matches containing the letter a:

String                         Matches
--------------------           --------------------
'a'                            ['a']
'aaa'                          ['aaa']
'aaabbb'                       ['aaa']
'aaabbbccc'                    ['aaa']
'yabba dabba doo'              ['a', 'a', 'a', 'a']

Therefore, the regex that solves the original problem is

\.[a-z]*(?:[a-z()]+|[a-z<>]+)

This correctly matches .hello(world), .hello<world>, and .hello>>>world<> in the three test cases in the original question.

When I asked the question, I had naively believed that because regular expressions are greedy, they must return the longest possible match from any string. This is not so. The greedy property of a regular expression only means that a quantifier such as + or * will continue to match, so long as more characters are available. However, this is not the same thing as a regex backtracking to find all valid matches, then comparing their relative lengths. In terms of finite-state automata, if it were possible to count and compare lengths, then the anbn language would be regular, which is a contradiction. Moreover, although regular languages are closed under a finite number of union operations, I was essentially trying to "glue" together individual automata to get the automaton of the union, like gluing together the a+, a+|b+, and a+b+c+ languages to obtain a+|a+b+|a+b+c+, even though the correct regex for the union of these languages is a+(b+(c+)?)?.

gz839918
  • 111
  • 5