Find simplest regular expression matching all given strings

Question

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?

It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:

foo(a|b|cd)bar should match fooabar, foobbar and foocdbar.

Examples

Given the set of strings h_q1_a, h_q1_b, h_q1_c, h_p2_a, h_p2_b, h_p2_c, the desired output of the algorithm would be h_(q1|p2)_(a|b|c).

Given the set of strings h_q1_a, h_q1_b, h_p2_a, the desired output of the algorithm would be h_(q1_(a|b)|p2_a). Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b, which was not in the original set of strings.

Use case

I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.

(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)

"such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings" -> I mean exactly the set of initial strings (i.e. not a superset). — fuenfundachtzig, Apr 04 '13 at 09:33
Given that you have a finite set of strings, you'll never use *, so that leaves concatenation and choice. You could try working with a prefix tree, which is what the algorithm sowrov links to will build for you. — G. Bach, Apr 04 '13 at 13:15
Emacs Lisp has such a function called `regexp-opt`. See [the source](http://bzr.savannah.gnu.org/lh/emacs/trunk/annotate/head:/lisp/emacs-lisp/regexp-opt.el) and [a description](http://irreal.org/blog/?p=614). The code is well commented, so you might be able to port some of it. — legoscia, Apr 04 '13 at 15:24
+1 for great question, but you mean `h_(?:q1|p2)_[a-c]`, right? :) — zx81, Jun 07 '14 at 11:47
Are you sure `?:` is needed? But you're right that `[a-c]` is shorter than `(a|b|c)`. Usually, however, the substrings would be longer than one character. — fuenfundachtzig, Jun 10 '14 at 06:40

rici · Answer 1 · 2013-04-04T17:50:46.513

There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)

The procedure is simple, if you have access to standard FSM algorithms:

Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.
Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.
Minimize the DFA.

DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.

For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.

Don't we lose important information in step 2 which we would need to avoid failure as exemplified in my 2nd example? — fuenfundachtzig, Apr 04 '13 at 21:08
@fuenfundachtzig: No, the conversion from NFA to DFA is precise. Exactly the same set of strings are matched. The minimization is also precise. Step 2 is not where we recombine paths; after step 2 we still have a tree. It's step 3 which makes it into a DAG — rici, Apr 04 '13 at 21:27
@fuenfundachtzig: by the way, you claim that your second example has solution `h_(q1_a|q1_b|p2_a)`. Why is it not `h_(q1_(a|b)|p2_a)`? That's one character shorter :) and more importantly, the corresponding DFA is minimized, having three fewer states. — rici, Apr 05 '13 at 06:22
The solution is not unique (but yours is better indeed!) because I didn't explicitly require the expression to be the shortest possible. But of course to avoid trivial solutions like A|B|C|D|... one should require the solution to be as short as possible (or at least "short" :) — fuenfundachtzig, Apr 05 '13 at 06:51

score 3 · Answer 2 · edited Apr 04 '13 at 12:05

3

You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:

h_q1_a
h_q1_b
h_q1_c
h_p2_a
h_p2_b
h_p2_c

will generate a finite machine that most probably look like this:

      [h_]         <-level 0
     /   \
  [q1]  [p2]       <-level 1
     \   /
      [_]          <-level 2
      /\  \
     /  \  \
    a    b  c      <-level 3

Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so

h_(q1|p2)_(a|b|c)
L0   L1  L2  L3

edited Apr 04 '13 at 12:05

Rafał Dowgird

43,216
11
77
90

answered Apr 04 '13 at 10:55

sowrov

1,018
10
16

Assuming the implementation [here](http://blog.ivank.net/aho-corasick-algorithm-in-as3.html) is correct this doesn't work because the algorithm doesn't converge to one node at level 2. (Which is the problem with all prefix trees I know.) – fuenfundachtzig Apr 04 '13 at 13:32

Find simplest regular expression matching all given strings

Examples

Use case

2 Answers2

Linked