We have written the system to analyse log messages from the large network. The system takes log messages from lots of different network elements, and analyses it by regex expressions. For example user may have written two rules:
^cron/script\.sh.*
.*script\.sh [0-9]+$
In this case only logs that match given patterns will be selected. The reason of the filtering is that there may be really lots of log messages, up to 1 GB per day.
Now the main part of my question. Since there is lots of network elements, and several types of them, and every one of them has different parameters in path... Is there any way to automatically generate set of regexes that will somehow group the logs? The system can learn on historical data, e.g. from the last week. Generated regex must not be very accurate, it is supposed to be the hint for user to add such new rule into system.
I was thinking about unsupervised machine learning to divide input into groups and then in each group find proper regex. Is there any other way, maybe faster or better? And, last but not least, how to find regex that matches all strings in obtained group? (Non-trivial, so .*
is not the answer.)
Edit After some thinking I'll try to simplify the problem. Suppose I have already grouped logs. I'd like to find (at most) three largest substrings (at least one) common to all the strings in set. For example:
Set of strings:
cron/script1.sh -abc 1243 all
cron/script2.sh 1
bin/script1.sh -asdf 15
Obtained groups:
/script
.sh
Now I can build some simple regex by concatenating these groups with .*?
. In this example it would be .*?(/script).*?(\.sh ).*?
. It seems to be simpler solution.