We need to parse several log files and run some statistics on the logs entries found (things such as number of occurrence of certain messages, spikes of occurrences, etc). The problem is with writing a log parser that will handle several log formats and will allow me to add a new log format with very little work.
To make things easier for now I'm only looking at logs that will basically look similar to this:
[11/17/11 14:07:14:030 EST] MyXmlParser E Premature end of file
so each log entry will contain a timestamp
, originator
(of the log message), level
and log message
. One important detail is that a message may have more than one line (e.g. stacktrace).
Another instance of the log entry could be:
17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file
I'm looking for a good way to specify the log format as well as the most adequate technology to implement the parser for it. I though about regular expressions but I think it will be tricky to handle situations such as the multi-line message (e.g. stacktrace).
Actually the task of writing a parser for a specific log format does not sound so easy itself when I consider the possibility of multi-line messages. How do you go about parsing those files?
Ideally I would be able to specify something like this as a log format:
[%TIMESTAMP] %ORIGIN %LEVEL %MESSAGE
or
%TIMESTAMP %LEVEL %ORIGIN - %MESSAGE
Obviously I would have to assign the right converter to each field to it would handle it correctly (e.g. the timestamp).
Could anyone give me some good ideas on how to implement this in a robust and modular way (I'm using Java) ?