How to write a Generic Log Parser

Question

We need to parse several log files and run some statistics on the logs entries found (things such as number of occurrence of certain messages, spikes of occurrences, etc). The problem is with writing a log parser that will handle several log formats and will allow me to add a new log format with very little work.

To make things easier for now I'm only looking at logs that will basically look similar to this:

[11/17/11 14:07:14:030 EST] MyXmlParser     E   Premature end of file

so each log entry will contain a timestamp, originator (of the log message), level and log message. One important detail is that a message may have more than one line (e.g. stacktrace). Another instance of the log entry could be:

17-11-2011 14:07:14 ERROR    MyXmlParser   - Premature end of file

I'm looking for a good way to specify the log format as well as the most adequate technology to implement the parser for it. I though about regular expressions but I think it will be tricky to handle situations such as the multi-line message (e.g. stacktrace).

Actually the task of writing a parser for a specific log format does not sound so easy itself when I consider the possibility of multi-line messages. How do you go about parsing those files?

Ideally I would be able to specify something like this as a log format:

[%TIMESTAMP] %ORIGIN %LEVEL %MESSAGE

or

%TIMESTAMP %LEVEL %ORIGIN - %MESSAGE

Obviously I would have to assign the right converter to each field to it would handle it correctly (e.g. the timestamp).

Could anyone give me some good ideas on how to implement this in a robust and modular way (I'm using Java) ?

http://stackoverflow.com/questions/465329/best-xml-format-for-log-events-in-terms-of-tool-support-for-data-mining-and-visu — ThomasRS, Dec 07 '11 at 15:51

score 3 · Answer 1 · answered Nov 28 '11 at 17:06

3

AWStats is a great log parser, open source, and you can do whatever you want with the resulting database that it generates.

answered Nov 28 '11 at 17:06

Matt H

6,422
2
28
32

Thanks for the information about AWStats. However, I took a quick look at it and it didn't seem to be a general purpose log parser. Am I missing something? Also, even if it is, I think in the long term I would still like to be able to parse the logs myself because I will want flexibility to implement some of the features I already have in mind. – Mario Duarte Nov 28 '11 at 20:54
1

You can define almost any kind of log. The LogFormat directive lets you tell it what to look for and parse. – Matt H Nov 28 '11 at 20:58

score 2 · Answer 2 · answered Dec 09 '11 at 16:38

You can use a Scanner for example, and some regexes. Here is a snippet of what I did to parse some complex logs :

private static final Pattern LINE_PATTERN = Pattern.compile(
  "(\\S+:)?(\\S+? \\S+?) \\S+? DEBUG \\S+? - DEMANDE_ID=(\\d+?) - listener (\\S+?) : (\\S+?)");

public static EventLog parse(String line) throws ParseException {
    String demandId;
    String listenerClass;
    long startTime;
    long endTime;

    SimpleDateFormat sdf = new SimpleDateFormat(DATE_PATTERN);
    Matcher matcher = LINE_PATTERN.matcher(line);
    if (matcher.matches()) {
        int offset = matcher.groupCount()-4; // 4 interesting groups, the first is optional
        demandeId = matcher.group(2+offset);
        listenerClass = matcher.group(3+offset);
        long time = sdf.parse(matcher.group(1+offset)).getTime();
        if ("starting".equals(matcher.group(4+offset))) {
            startTime = time;
            endTime = -1;
        } else {
            startTime = -1;
            endTime = time;
        }
        return new EventLog(demandeId, listenerClass, startTime, endTime);
    }
    return null;
}

So, with regexes and groups, it works pretty well.

score 1 · Accepted Answer · answered Oct 11 '13 at 13:47

1

I ended up not writing my own and using logstash.

answered Oct 11 '13 at 13:47

Mario Duarte

3,145
7
27
37

score 1 · Answer 4 · answered Dec 13 '11 at 12:37

If you have the possibility (and you should with a good logger framework) I would recommend you to duplicate logs in a parsable format. For example, with log4j use an XMLLayout or something like this. It will be a lot easier to parse because then you will know the exact format of the logs.

You can do this quite transparently to the running app just by setup. Think about using asynchronuous appender in order to not disturb too much the running application.

Also if the XMLLayout can suit your needs have a look at Apache chainsaw

score 1 · Answer 5 · answered Dec 23 '11 at 23:27

Log4j's LogFilePatternReceiver does exactly that...

This log entry: 17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file

Can be parsed using the following logformat (assuming origin is the same as 'logger'), with a timestamp leveraging Java's SimpleDateFormat of dd-MM-yyyy kk:mm:ss

TIMESTAMP LEVEL LOGGER - MESSAGE

The timezone and the level in the other form are a little tricker...there is the ability to remap strings to levels (E to ERROR) but I don't know that the timezone will quite work.

Try it out, check out the source, and play with support for it in the latest developer snapshot of Chainsaw:

http://people.apache.org/~sdeboy

score 0 · Answer 6 · answered Dec 08 '11 at 22:28

At work we rolled our own log parser (in Java) so we could filter the known stacktraces out of the production logs to identify new potential production problems. It uses regex and it's tightly coupled to our log4j log format.

We've also got a python script that runs over the live production transaction logs and reports (to SiteScope - our infrastructure monitoring tool) when the count for particular errors is too high.

While both are useful, they are awful to maintain, and I would recommend trying any open source tool parsing tool first, and resorting to writing your own only if necessary. Heck, I would even pay for a tool that did this ;)

score 0 · Answer 7 · answered Dec 12 '11 at 16:21

Maybe you could write a Log4j CustomAppender? For example as described here: http://mytechattempts.wordpress.com/2011/05/10/log4j-custom-memory-appender/

Your custom appender could use a database or simple Java objects queried by JMX to get your statistics. All just depends on how much data is needed to be persisted.

How to write a Generic Log Parser

7 Answers7

Linked