7

I'm trying to efficiently extract static strings (strings that MUST be matched for a given regular expression to match). I've been able to do it in the simplest cases but I'm trying to discover a more robust solution.


Given a regex such as the one below

"fox jump(ed|ing|s)"

would give us

"fox,jumped,jumping,jumps"

Another example is

"fox jump(ed|ing|s)?"

which would give us

"fox,jump"

because of the optional operator


The algorithm I have is overly simple for now. It will start from the end of the regex and removes groups or a single character followed by these operators "* ?" as well as "explode" grouped OR operators "(|)". This has worked quite well but doesn't take into consideration the full syntax of a regex. You can think of it as kind of a minimal set generating process for a regex (the minimal set of strings that the regex can "generate/must match").

WHY? I'm trying to match a bunch of text against a large set of regexes. If I can get a list of "keywords" for these regexes that is "required" I can do a quick text search for that keyword to filter the regexes I care about (ignore the ones I am guaranteed to not match or even skip that text entirely effectively not running any regexes on the text because we are guarenteed to not have a match within our set of regexes). I can organize this set of keywords in an efficient data structure (Binary Search/Trie/Aho-Corasick) to filter the set of regexes before I even try to run the text through the Finite Automata. There are extremely fast string matching algorithms that I can run as a filtering stage before I attempt to run a regular expression. I've been able to increase throughput many folds doing this simple process.

zer0bit
  • 615
  • 5
  • 13
  • Why do this? Some background might bring out some better ways of achieving what you're trying to do. – Highly Irregular Dec 20 '12 at 21:48
  • 3
    added some background in the WHY? section. thx! – zer0bit Dec 20 '12 at 21:55
  • +1 Sounds like it's well thought out – Highly Irregular Dec 20 '12 at 21:59
  • 1
    [Here](http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python) [are](http://stackoverflow.com/questions/2654947/how-can-i-generate-all-possible-permutations-from-a-perl-regular-expression) [some](http://stackoverflow.com/questions/1248519/how-can-i-expand-a-finite-pattern-into-all-its-possible-matches) [related](http://stackoverflow.com/questions/1667528/regular-expression-listing-all-possibilities) [questions](http://stackoverflow.com/questions/4418544/generate-test-data-from-a-regex). – Andrew Cheong Dec 20 '12 at 22:53
  • First, you need to parse the regex (the full syntax - so that you can identify features of the regex that you don't want to implement), then write a generator from the parse tree. You may want to set an upper limit for those that can match infinite set or large set of input, such as `.`, negated character classes, `*`, `+`, `{n,}`. I mostly do the talking here, so I don't know whether this is actually feasible, but it is worth some try. – nhahtdh Dec 21 '12 at 03:52
  • possible duplicate of [Telling if regular expression contains a single invariable segment](http://stackoverflow.com/questions/9300683/telling-if-regular-expression-contains-a-single-invariable-segment) – porges Oct 07 '13 at 06:13

2 Answers2

0

If I understand your problem correctly, you are looking for a set of words such that all these words are (disjoint) substrings of any word accepted by the (given) regular expression.

My guess is that such a set will very often be empty, but nevertheless it can be found.

To find such a set, I propose the following algorithm:

  1. Find the FA corresponding to your input regex.
  2. Identify bridges ( https://en.wikipedia.org/wiki/Bridge_(graph_theory) ) between the starting state S and the accepting state F. This can for example be done by removing an edge E and asking whether a path exists from S to E in the FA with E removed - repeat this for all edges.
  3. All edges that are bridges must be hit during an accepting run, and each edge corresponds to a letter of input.
  4. You may now generate the required words by connecting subsequent bridge edges end-to-end.

I think it follows from the algorithm construction that an FA (and not a DFA) suffices for this to work. Again, a proof would be nice but I think it should work:)

t.dubrownik
  • 12,464
  • 1
  • 17
  • 6
0

See the library Xeger which given a regular expression will give you all the possible strings that match.

You seem to only want to keep the common prefix of these strings (the part where you said to ignore optional operators) but if you do that you might capture stings that have that common prefix yet do not have the ending you want (such as "jumpy" in your example). If this is not a problem then just find the shortest string given by Xeger, assuming that optional operators occur only at the end of the regex.

mtanti
  • 794
  • 9
  • 25