Regular expression generator/reducer?

Question

I was posed an interesting question from a colleague for an operational pain point we currently have, and am curious if there's anything out there (utility/library/algorithm) that might help automate this.

Say you have a list of literal values (in our cases, they are URLs). What we want to do is, based on this list, come up with a single regex that matches all of those literal items.

So, if my list is:

http://www.example.com
http://www.example.com/subdir
http://foo.example.com

The simplest answer is

^(http://www.example.com|http://www.example.com/subdir|http://foo.example.com)$

but this gets large for lots of data, and we have a length limit we're trying to stay under.

Currently we manually write the regexes but this doesn't scale very well nor is it a great use of anyone's time. Is there a more automated way of decomposing the source data to come up with a length-optimal regex that matches all of the source values?

Trivial reduction: "^.*$" matches all the source values. Perhaps you meant one that *only* matches the specified inputs? — Jerry Coffin, Jul 07 '10 at 15:13
Do you want to match all *and only* those strings? Or find a general regexp (like `*.abc.*`) that matches them all, but possibly others as well? — Mau, Jul 08 '10 at 09:56
Possible duplicate of [Is it possible for a computer to "learn" a regular expression by user-provided examples?](https://stackoverflow.com/questions/616292/is-it-possible-for-a-computer-to-learn-a-regular-expression-by-user-provided-e) — Anderson Green, Feb 12 '18 at 06:26

score 16 · Accepted Answer · answered Jul 07 '10 at 15:26

16

The Aho-Corasick matching algorithm constructs a finite automaton to match multiple strings. You could convert the automaton to its equivalent regex but it is simpler to use the automaton directly (this is what the algorithm does.)

answered Jul 07 '10 at 15:26

Rafał Dowgird

43,216
11
77
90

score 8 · Answer 2 · answered Aug 22 '12 at 14:28

8

Today I was searching that. I didn't found it, so I create a tool: kemio.com.ar/tools/lst-trie-re.php

You put a list on the right side, submit it, and get the regexp on the left one.

I tried with a 6Kb list of words, and produced a regexp of 4Kb (that I put on a JS file) like: var re=new RegExp(/..../,"mib");

Don't abuse of it, please.

answered Aug 22 '12 at 14:28

ESL

986
11
18

nice, but i found a bug, try adding "test" and "test22" It answers "test2" instead of "test(22)?" – Séverin Apr 11 '18 at 09:59
Thanks! I will check it. I didn't remember the existence of this script, haha. – ESL Apr 17 '18 at 03:04
As it is writen in PHP very quick (not very mantainable) and has some bugs, currently I just recommend using another one, like this: https://github.com/alexeld/regex-trie – ESL Jun 09 '18 at 02:51

score 3 · Answer 3 · answered Mar 01 '11 at 17:04

3

The Emacs utility function regexp-opt (source code) does not do exactly what you want (it only works on fixed strings), but it might be a useful starting point.

answered Mar 01 '11 at 17:04

zwol

135,547
38
252
361

score 2 · Answer 4 · answered Jul 08 '10 at 10:00

If you want to compare against all the strings in a set and only against those, use a trie, or compressed trie, or even better a directed acyclic word graph. The latter should be particularly efficient for URLs IMO.

You would have to abandon regexps though.

score 1 · Answer 5 · answered Jul 07 '10 at 15:26

I think it would make sense to take a step back and think about what you're doing, and why.

To match all those URLs, only those URLs and no other, you don't need a regexp; you can probably get acceptable performance from doing exact string comparisons over each item in your list of URLs.

If you do need regexps, then what are the variable differences you're trying to accomodate? I.e. which part of the input must match verbatim, and where is there wiggle room?

If you really do want to use a regexp to match a fixed list of strings, perhaps for performance reasons, then it should be simple enough to write a method that glues all your input strings together as alternatives, as in your example. The state machine doing regexp matching behind the scenes is quite clever and will not run more slowly if your match alternatives have common (and thus possibly redundant) substrings.

There are some system limits that are in the way right now of just gluing the regexes together, which is where the length limit comes from. At the moment, it would have to be regex since there are use cases where "real" regexes (and not just literals) would be desired for matching. And, instead of a handful of URLs, blow this out to millions of URLs (total) across tens-of-thousands of groups. — Joe, Jul 07 '10 at 16:31

score 1 · Answer 6 · answered Jul 08 '10 at 09:33

Taking the cue from the other two answers, is all you need to match is only the strings supplied, you probably better off doing a straight string match (slow) or constructing a simple FSM that matches those strings(fast).

A regex actually creates a FSM and then matches your input against it, so if the inputs are from a set of previously known set, it is possible and often easier to make the FSM yourself instead of trying to auto-generate a regex.

Aho-Corasick has already been suggested. It is fast, but can be tricky to implement. How about putting all the strings in a Trie and then querying on that instead (since you are matching entire strings, not searching for substrings)?

score 0 · Answer 7 · answered Sep 11 '17 at 15:36

An easy way to do this is to use Python's hachoir_regex module:

urls = ['http://www.example.com','http://www.example.com/subdir','http://foo.example.com']
as_regex = [hachoir_regex.parse(url) for url in urls]
reduce(lambda x, y: x | y, as_regex)

creates the simplified regular expression

http://(www.example.com(|/subdir)|foo.example.com)

The code first creates a simple regex type for each URL, then concatenates these with | in the reduce step.

Regular expression generator/reducer?

7 Answers7

Linked

Related