Automatically built regex expressions that fit set of strings

Question

We have written the system to analyse log messages from the large network. The system takes log messages from lots of different network elements, and analyses it by regex expressions. For example user may have written two rules:

^cron/script\.sh.*
.*script\.sh [0-9]+$

In this case only logs that match given patterns will be selected. The reason of the filtering is that there may be really lots of log messages, up to 1 GB per day.

Now the main part of my question. Since there is lots of network elements, and several types of them, and every one of them has different parameters in path... Is there any way to automatically generate set of regexes that will somehow group the logs? The system can learn on historical data, e.g. from the last week. Generated regex must not be very accurate, it is supposed to be the hint for user to add such new rule into system.

I was thinking about unsupervised machine learning to divide input into groups and then in each group find proper regex. Is there any other way, maybe faster or better? And, last but not least, how to find regex that matches all strings in obtained group? (Non-trivial, so .* is not the answer.)

Edit After some thinking I'll try to simplify the problem. Suppose I have already grouped logs. I'd like to find (at most) three largest substrings (at least one) common to all the strings in set. For example:

Set of strings:
cron/script1.sh -abc 1243 all
cron/script2.sh 1
bin/script1.sh -asdf 15

Obtained groups:
/script
.sh

Now I can build some simple regex by concatenating these groups with .*?. In this example it would be .*?(/script).*?(\.sh ).*?. It seems to be simpler solution.

Tricky question. Probably quite non-trivial given the fact that a tool written by one of the leading experts on regular expressions for exactly this purpose still needs a lot of user interaction: See http://www.regexmagic.com — Tim Pietzcker, Oct 06 '11 at 11:21
There is an easy proof that for any finite set of strings there are infinite number of regular expressions that would match them, so yes it *is* non-trivial task. — hamstergene, Oct 06 '11 at 11:47
I believe there may be some heuristic that allows finding the regex. It does not have to be full regex; I think just finding set of common substrings and then concatenating them with `.*?` would be enough for starters. — Archie, Oct 06 '11 at 11:52
This problem isn't well defined. The most restrictive regex matching words w1, w2, ..., wN is w1 + w2 + ... + wN, and the least restrictive is E*. You want one in the middle? Have your pick. — Patrick87, Oct 06 '11 at 12:29
Possible duplicate of [Is it possible for a computer to "learn" a regular expression by user-provided examples?](https://stackoverflow.com/questions/616292/is-it-possible-for-a-computer-to-learn-a-regular-expression-by-user-provided-e) — Anderson Green, Feb 10 '18 at 20:12

Marco Mauri · Answer 1 · 2013-12-17T12:49:16.493

8

You could try the tool hosted at this site: http://regex.inginf.units.it/

This tool automatically generates a regex from a set of examples, so it should be perfect for your use case. In the website it is also described how it works in details (it is based on genetic programming).

edited Dec 17 '13 at 12:49

answered Nov 14 '12 at 08:49

Marco Mauri

81
1
5

Patrick87 · Accepted Answer · 2011-10-06T22:42:53.657

OK, we'll try to break this down into manageable steps.

  1. For each substring w in s1, in order of non-increasing length,
  2.  assume w is a substring of the other sM
  3.  for each string of the other sN,
  4.   if w is not a substring of sN, disprove assumption and break
  5.  if the assumption held, save w
  6.  if you've found three w that work, break
  7. You have recorded between 0 and 3 w that work.

Note that not all sets of strings are guaranteed to have common substrings (except the empty string). In the worst case, assume s1 is the longest string. There are O(n^2) substrings of s1 (|s1| = n) and it takes O(n) to compare to each of m other strings... so the asymptotic complexity is, I believe, O(n^2 * nm)... even though the algorithm is naive, this should be pretty manageable (polynomial, after all, and quadratic at that).

The transformation to e.g. C code should be straightforward... use a sliding window with a decrementing length loop to get substrings of s1, and then use linear searchers to find matches in the other strings.

I'm sure there are smarter / asymptotically better ways of doing this, but any algorithm will have to look at all characters in all strings, so O(nm)... may not be completely right here.

The problem that needs to be solved here is called ["induction of regular languages."](https://en.wikipedia.org/wiki/Induction_of_regular_languages) — Anderson Green, Jul 02 '22 at 19:51

score 0 · Answer 3 · answered Mar 01 '22 at 10:29

naively, you could construct a regex pattern for a given set of strings by joining all strings with or (denoted by pipe, |). if you want a shorter pattern something more optimized, you could create a trie from the set of strings and then construct a regex. e.g. take a look at trieregex.

score -1 · Answer 4 · answered Oct 06 '11 at 15:28

It seems to me that your trying to do something that regex does not do. A regex recognizes tokens from its input, nothing more. It is up to the rest of the program to determine where the input comes from, how its grouped, and what to do with the tokens it gets.

So, lets start from the beginning. You have an input stream consisting of log events from various different network locations. Abstracting this we get a single text stream out. See this website for examples in how to do this in a simplified way with python.

Now that we have text, we want to break it into logical groups. This is what your regexes are for, though we really should be using a full blown lexer. These tokens go to the parser section of the code.

The parser determines what to do with the tokens we've collected. Now's when we analyze them, by doing something simple first. Such as, adding each unique "name" token to a list, with the count of occurances. If this number is larger than say, 50, we output the token.

I've glossed over the gritty details here. Primarily, you will have to determine a "language" for your tokenizing. for instance, if you have log strings like this: ipOfComputer /full/path/to/script.sh args then your tokenizing would need to understand these bits.

Read the book Lex and yacc for an excellent introduction to the various topics involved. Note that this will be independent of your users regex bit. That will happen on the entire stream as well, doing its thing. The output from this program will become simple regex's to add to the config file.

Automatically built regex expressions that fit set of strings

4 Answers4

Linked