Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?
It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR
of substrings:
foo(a|b|cd)bar
should match fooabar
, foobbar
and foocdbar
.
Examples
Given the set of strings h_q1_a
, h_q1_b
, h_q1_c
, h_p2_a
, h_p2_b
, h_p2_c
, the desired output of the algorithm would be h_(q1|p2)_(a|b|c)
.
Given the set of strings h_q1_a
, h_q1_b
, h_p2_a
, the desired output of the algorithm would be h_(q1_(a|b)|p2_a)
. Note that h_(q1|p2)_(a|b)
would not be correct because that expand to 4 strings, including h_p2_b
, which was not in the original set of strings.
Use case
I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.
(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)