4

I tried this regex:

ab(cd|c)*d

in the regex101 and RegExr websites. It matched this text completely:

abcdcdd

Now let's swap "cd" and "c" in the regex:

ab(c|cd)*d

When I try this regex in the websites, I see this regex does not completely match the same text.

Why doesn't the regex engine recognize that ab(cd|c)*d and ab(c|cd)*d are the same, and how can I persuade ab(c|cd)*d to match the longest string?


REGEX: ab(cd|c)*d

Complete text matched in 13 steps: abcdcdd


REGEX: ab(c|cd)*d

Partial text matched in 9 steps: abcdcdd

joanis
  • 10,635
  • 14
  • 30
  • 40
tiki
  • 57
  • 2
  • "why regex engine can't to recognize that `ab(cd|c)*d` and `ab(c|cd)*d` are same". Well, because as you have observed, they are _not_ the same... – Sweeper Aug 31 '19 at 12:21
  • If the `d` is optional, you could omit the pipe and make the d optional. `ab(cd?)*d`. Note that it repeats the capturing group. https://regex101.com/r/frwMkI/1 – The fourth bird Aug 31 '19 at 12:47
  • 1
    There are a lot of very technical answers here, but the simple answer is that Regex matches `|` (or statements) with preference to the left-most pattern, it will never try the second pattern if the first matches and nothing causes the the match head to backtrack. In your second example `c` matches from `c|cd`, so we exit the "or" part, and then match `d`, leaving `cdd` unmatched. – Mako212 Sep 01 '19 at 00:47
  • **Duplicate of [Why order matters in this RegEx with alternation](https://stackoverflow.com/questions/10248776/why-order-matters-in-this-regex-with-alternation)** – Wiktor Stribiżew Sep 04 '19 at 14:16
  • See also [Use the right regex flavor!](https://stackoverflow.com/a/36296918/3216427) for examples of regular expression engines that will match the longest sequence. – joanis Sep 04 '19 at 14:36

4 Answers4

5

@MurrayW's answer is excellent, but I would like to add some background information.

Regex as Finite State Automata

When I first learned regular expressions in university, we learned to convert them to finite state automata, essentially compiling them into graphs that were then processed to match the string. When you do that, (cd|c) and (c|cd) get compiled into the same graph, in which case both of your regular expressions would match the whole string. This is what grep actually does:

Both

echo abcdcdd | grep --color -E 'ab(c|cd)*d'

and

echo abcdcdd | grep --color -E 'ab(cd|c)*d'

color the whole string in red.

Patterns we call "regular expressions"

True finite state automata have many limitations that programmers don't like, such as the inability to capture matching groups, of to reuse those groups later in the pattern, and other limitations I forget, so the regular expression libraries that we use in most programming languages implement more complex formalisms. I don't remember that they are exactly, maybe push-down automata, but we have memory, we have backtracking, and all sorts of good stuff we use without thinking about it.

At the risk of seeming pedantic, the patterns we use are not "regular" at all. I know, the difference is usually not relevant, we just want our code to work, but once in a while it matters.

So, while the regular expressions (cd|c) and (c|cd) would be compiled into the same finite state automaton, those two (non-regular) patterns are instead turned into logic that says try the variants from left to right, and backtrack only if the rest of the pattern fails to match later, hence the results you observed.

Speed

While the patterns our "regular expression" libraries support offer us lots of goodies we like, those come at a performance cost. True regular expressions are blazingly fast, while our patterns, though usually fast, can sometimes be very expensive. Search for "catastrophic backtracking" on this site for many examples of patterns that take exponential time to fail. The same patterns, used with grep, would be compiled into a graph that is applied in linear time to the string to match no matter what.

joanis
  • 10,635
  • 14
  • 30
  • 40
  • Usually the construction algorithm first goes from a regular expression to a non-deterministic finite state automata and then to a finite state automata. Those regular expression "engines" that instead emulate NFSAs should be doing backtracking and therefore (cd|c) and (c|cd) should be equivalent. I confess that I was surprised by the OP's post. I am not sure what you meant when you say that "the patterns we use are not regular at all" unless you mean to imply we do not get the expected results. I consider (cd|c) not being equivalent to (c|cd) a bug. – Booboo Aug 31 '19 at 21:58
  • @RonaldAaronson I agree it should be considered a bug if the expression truly was regular. But our misnamed patterns in most "regular" expression libraries are not actually intended to be regular, we just stick with that name for historical reasons. – joanis Aug 31 '19 at 22:03
  • I am trying to understand what is not *regular* about `ab(cd|c)*d`. What about `ab(cd?)*d`, which in my mind is also equivalent? – Booboo Aug 31 '19 at 22:12
  • You misunderstand me and the question: OP and I are *not* talking about regular expressions, we're talking about regex101.com, the `//` operator in Perl, the "regular expression" libraries of various languages. These things implement libraries that support expressions that sure look like regular expressions, but are not actually processed as regular expressions. All these libraries are misnamed, and don't respect the strict rules of regular expressions. – joanis Aug 31 '19 at 22:19
  • @RonaldAaronson Just adding one more thought: you pointed out that the two patterns look just like regular expressions. Indeed, I think that is probably why all our libraries are called regular expression libraries, even if the formalism applied to them is different. – joanis Sep 01 '19 at 12:49
  • Now I believe we are on the same page. – Booboo Sep 01 '19 at 12:54
3

Because the | character performs an or operation by testing the left-most condition first. If that matches, nothing further is tested in the or. If that fails, then the next or element is tested, and so on.

Using regex pattern ab(cd|c)*d, you can see that the cd part of (cd|c)* matches in your string, and is also repeated: abcdcdd.

However, in pattern ab(c|cd)*d, the c matches from the or operation in abcdcdd and so cd isn't tested at all. Then, the d at the end of the pattern matches the d after the first c and then the pattern stops, having only matched abcdcdd

MurrayW
  • 393
  • 2
  • 10
1

As previously answered in the comments, they are not the same patterns. The alternation in the first one tries to match cd first, the second one c first.

First pattern

abcdcdd
  ^^^^
   ||
   ||
ab(cd|c)*d

Second pattern

 abcdcdd
   ^^____
   |     |
   |     |
ab(c|cd)*d

If the d is optional, you can omit the pipe for the alternation and make the d optional.

ab(cd?)*d.

Regex demo

Note that this way you repeat the capturing group which will hold the value of the last iteration.

If you are not interrested in the value of the group and non capturing groups are supported you could use ab(?:cd?)*d.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

Regex is always a left to right proposition.
The only way a regex engine will ignore a previous alternation construct
is if it has to satisfy a term on the right side of the alternation group
that cannot be satisfied otherwise.

The regex rule is that the pattern is traversed from left to right,
but is controlled by the target string being traversed from left to right.

The symbiosis ..

Given the target string was matched like so "abcdcdd"
its easy to assume that the regex subset of the full regex

 ab
 ( c | cd )*                   # (1)
 d

is clearly

 ab
 c* 
 d

where the cd term of the alternation to the right was never needed
for a successful match.

This proves regex engines are a Left to Right bias machine.

  • I'm afraid I have to disagree: a regex properly compiled to a finite state automaton, such as `grep` does, is not a left to right proposition. Rather it will always accept the longest possible substring it can accept. But our regex libraries don't use finite state automata, and your answer is correct about our libraries. (I'm only objecting to the word "always" - yes, they're left to right in some contexts, just not always.) – joanis Sep 01 '19 at 12:53
  • @joanis - Always matching the longest match is worse than matching the first match, don't you agree ? Imagine 4000 alternations, it's just crap to design such a thing as longest match. I am going to repeat myself `Regex is always a left to right proposition` a first come, first serve proposition, _ALWAYS_ ! –  Sep 01 '19 at 19:23
  • @joanis - I'll leave that formal language stuff to you, although I did get an _A_ in _Formal Symbolic Logic_, an elective I took in school. I do bring to the table a certain level of expertise in regex you don't find on this site. –  Sep 01 '19 at 21:39
  • @joanis - I'm more user than theorist. Although only because I'm dyslexic. I did however manage to modify / fix the boost regex engine though and learnt some stuff –  Sep 01 '19 at 21:45
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198802/discussion-between-joanis-and-sln). – joanis Sep 01 '19 at 21:48