I'm trying to efficiently extract static strings (strings that MUST be matched for a given regular expression to match). I've been able to do it in the simplest cases but I'm trying to discover a more robust solution.
Given a regex such as the one below
"fox jump(ed|ing|s)"
would give us
"fox,jumped,jumping,jumps"
Another example is
"fox jump(ed|ing|s)?"
which would give us
"fox,jump"
because of the optional operator
The algorithm I have is overly simple for now. It will start from the end of the regex and removes groups or a single character followed by these operators "* ?" as well as "explode" grouped OR operators "(|)". This has worked quite well but doesn't take into consideration the full syntax of a regex. You can think of it as kind of a minimal set generating process for a regex (the minimal set of strings that the regex can "generate/must match").
WHY? I'm trying to match a bunch of text against a large set of regexes. If I can get a list of "keywords" for these regexes that is "required" I can do a quick text search for that keyword to filter the regexes I care about (ignore the ones I am guaranteed to not match or even skip that text entirely effectively not running any regexes on the text because we are guarenteed to not have a match within our set of regexes). I can organize this set of keywords in an efficient data structure (Binary Search/Trie/Aho-Corasick) to filter the set of regexes before I even try to run the text through the Finite Automata. There are extremely fast string matching algorithms that I can run as a filtering stage before I attempt to run a regular expression. I've been able to increase throughput many folds doing this simple process.