12

If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?

That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.

It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.

The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.

captncraig
  • 22,118
  • 17
  • 108
  • 151
  • possible duplicate of [How can you detect if two regular expressions overlap in the strings they can match?](http://stackoverflow.com/questions/1849447/how-can-you-detect-if-two-regular-expressions-overlap-in-the-strings-they-can-mat) – ire_and_curses Jun 03 '10 at 17:01
  • I guess I misunderstood. You mean that two given regular expressions must be fully mutually exclusive to *any* input string? I.e., that of 2^32 of possible four-byte strings, a regex may only match one possibility? Isn't that the same as saying: match this exact string? – Abel Jun 03 '10 at 17:23
  • I mean the intersection of the regexes must be zero. No string matches more than 1 regex. – captncraig Jun 03 '10 at 17:25
  • Furthermore I should note that I am talking about valid c# regular expressions. – captncraig Jun 03 '10 at 17:26
  • Then, I'm afraid that my original answer still holds (and the last paragraph of Jim's answer). You cannot do that, simply because of the very nature of these C# (which are NFA) regular expressions. (PS: i deleted mine, as it was too crappy. Go for Jim's) – Abel Jun 03 '10 at 17:32

3 Answers3

7

If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible. What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.

The problem with this is that the first step of the usual regex->DFA algorithm is to convert the regex to a NFA, then convert the NFA to a DFA. But that last step can result in an exponential blowup in the number of DFA states, so this will only be feasible for very simple regular expressions.

If you are working with extended regex syntax, all bets are off: context-free languages are not closed under intersection, so this method won't work.

Jim Lewis
  • 43,505
  • 7
  • 82
  • 96
  • Intriguing thought. I think you're effectively crossing swords with Jeffrey Friedl, who said (page 157) "Talking about DFA matching is very boring". You just made it interesting again (accept that DFA still limits you highly)! – Abel Jun 03 '10 at 17:17
  • Thats what I feared. Very interesting answer though. – captncraig Jun 03 '10 at 17:52
1

The Wkipedia article on regular expressions does state

It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).

but gives no further hints.

Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • 1
    I believe CaptnCraig wants to know if the languages have a non-empty intersection, not whether they're identical. – Jim Lewis Jun 03 '10 at 17:15
0

You can't do that by only looking at the regular expression.

Consider the case where you have [0-9] and [0-9]+. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.

The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.

Seth
  • 45,033
  • 10
  • 85
  • 120
  • *"When applied to string "11" they produce different results. "* actually: they don't, they produce the same result. Unless you add anchoring. – Abel Jun 03 '10 at 17:00
  • For pure regular expressions, what CaptnCraig asks is possible (but may be inefficient). He wants to know if two regular languages (specified by regular expressions) have a non-empty intersection. For your example the answer is "yes". – Jim Lewis Jun 03 '10 at 17:02
  • @Abel: I think what he meant is that the part of the string they match is different. They'll both match though. – Matti Virkkunen Jun 03 '10 at 17:02
  • Sorry, my question was bad. Maybe what I meant to ask is that only one matches at position 0. – captncraig Jun 03 '10 at 17:06
  • @Jim: how do you find that it is possible to find the intersection of two infinite sets? Can you elaborate? @Matti: indeed, they match different parts (unless you have a non-greedy engine, but that's rare, but it happens) :) – Abel Jun 03 '10 at 17:08
  • @CaptnCraig: perhaps you can update your q. with examples? Is "position 0" the same as the first character? Or the first string? – Abel Jun 03 '10 at 17:10
  • @Abel: Convert each regex to a DFA, combine the DFAs to produce a DFA recognizing the intersection, then check if the intersection is empty. Regular languages are closed under intersection, and this can be proved in finite time even if the languages are infinite. – Jim Lewis Jun 03 '10 at 17:12
  • I was under the impression that the regexes were allowed to overlap, but that given the input strings, they were only allowed to match one of them, and only one of them. While you prove (non)intersection, intersection might be allowed if the rule is based on the input strings. I.e., `k*` and `j*` are good if input is `cat, khaki, joe`, while the regexps do intersect. – Abel Jun 03 '10 at 17:29