7

Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, the greedy quantifiers ?, *, and +, and the nongreedy quantifiers ??, *?, and +?. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does replacing ordered alternation with unordered alternation decrease the expressive power of this formalism?

(Unordered alternation---also sometimes called "unordered choice"---is such that L(S|T) = L(S) + L(T), while ordered alternation is such that L(S|T) = L(S) + (L(T) - { a in L(T) : a extends some b in L(S) }). Concretely, the pattern a|aa would match the strings a and aa if the alternation is unordered, but only a if the alternation is ordered.)

Put another way, given a pattern S containing an ordered alternation, can that pattern be rewritten to an equivalent pattern T which contains no ordered alternations (but possibly unordered alternations instead)?

If this question has been considered in the literature, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).

uckelman
  • 25,298
  • 8
  • 64
  • 82
  • I don't see how ordered alternation could be equivalently written with help of unordered alternations and such limited regex. (But that alternation is unordered could be part of the specification, in many cases it wouldn't matter.) – Qtax Jul 20 '11 at 20:16

2 Answers2

1

in http://swtch.com/~rsc/regexp/regexp3.html [section "Does the regexp match a substring of the string? If so, where?"] it's necessary to introduce the idea of priorities within the "DFA" (you need to read the entire series to understand, i suspect, but the "DFA" in question is expanded from the NFA graph "on the fly") to handle ordered alternations. while this is only an appeal to authority, and not a proof, i think it's fair to say that if russ cox can't do it (express ordered alternations as a pure DFA), then no-one knows how to.

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
-1

I haven't checked any literature but I think you can construct a DFA for the ordered alternation and thus prove that it doesn't add any expressive power in the following way:

  1. Let's say we have the regex x||y where x and y are regexen and || means the unordered alternation. If so we can construct DFA's accepting x and y. We will mark those DFA_x and DFA_y
  2. We will construct the DFA for x||y in phases by connecting DFA_x and DFA_y
  3. For every path in DFA_x corresponding to some string a (by path I mean a path in the graph sense without traversing and edge twice so a is a path in DFA_"a*" but aa is not)...
    • For every symbol in the alphabet s
      • If DFA_y consumes as (that is if run on as DFA_y will not stop early but it may not necessarily accept) and DFA_x does not and DFA_x doesn't accept any prefix of as create a transition from the state DFA_x ends in after consuming a to the state DFA_y ends in after consuming as
  4. The accepting states of the final DFA are all the accepting states of both the input DFA's. The starting state is the starting state of DFA_x.

Intuitively what this does is it creates two regions in the output DFA. One of them corresponds to the first argument of the alternation and the other to the second. As long as it's possible that the first argument of the alternation will match we stay in the first part. When a symbol is encountered which makes it certain that the first argument won't match we switch to the second part if possible at this point. Please comment if this approach is wrong.

Paweł Obrok
  • 22,568
  • 8
  • 74
  • 70
  • How could a DFA not consume a character? The very definition of DFAs includes their completeness! – Nowhere man Sep 17 '11 at 18:21
  • That may not be as clear as I at first thought. What I meant by that is that we don't know up until the last character of this string if it will be accepted or not. For example a DFA accepting `(aa)+` will "consume" in this sense any string of the form `a+`, it will not, however, consume `ab`. – Paweł Obrok Sep 17 '11 at 18:41
  • Yes it will, that's the very definition of a DFA! The character 'b' will lead the DFA to its error state, but if 'b' is not consumed, then it is not, by definition, a DFA. – Nowhere man Mar 16 '12 at 23:59
  • OK - then replace "consume" with "consume without visiting the error state". – Paweł Obrok Mar 20 '12 at 13:28