2

I'm using regular expressions to parse logs. I was previously reading the File into a string array, and then iterating through the string array appending if I don't match the timestamp, otherwise I add the line I'm iterating on to a variable and continue the search. Once I get a complete log entry, I use another regular expression to parse it.

Scanning file

try {
    List<String> lines = Files.readAllLines(filepath);

    Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}");
    Matcher matcher;
    String currentEntry = "";
    for(String line : lines) {
        matcher = pattern.matcher(line);
        // If this is a new entry, then wrap up the previous one and start again
        if ( matcher.lookingAt() ) {
            // If the previous entry was not empty
            if(!StringUtils.trimWhitespace(currentEntry).isEmpty()) {
                entries.add(new LogEntry(currentEntry));
            }

            // Clear the current entry
            currentEntry = "";
        }

        if (!currentEntry.trim().isEmpty())
            currentEntry += "\n";
        currentEntry += line;
    }
    // At the end, if we have one leftover entry, add it
    if (!currentEntry.isEmpty()) {
        entries.add(new LogEntry(currentEntry));
    }
}catch (Exception ex){
    return null;
}

Parsing entry

final private static String timestampRgx = "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})";
final private static String levelRgx = "(?<level>(?>INFO|ERROR|WARN|TRACE|DEBUG|FATAL))";
final private static String classRgx = "\\[(?<class>[^]]+)\\]";
final private static String threadRgx = "\\[(?<thread>[^]]+)\\]";
final private static String textRgx = "(?<text>.*)";
private static Pattern PatternFullLog = Pattern.compile(timestampRgx + " " + levelRgx + "\\s+" + classRgx + "-" + threadRgx + "\\s+" + textRgx + "$", Pattern.DOTALL);

public LogEntry(String logText) {

    try {
        Matcher matcher = PatternFullLog.matcher(logText);
        matcher.find();

        String dateStr = matcher.group("timestamp");
        timestamp = new DateLogLevel();
        timestamp.parseLogDate(dateStr);

        String levelStr = matcher.group("level");
        loglevel = LOG_LEVEL.valueOf(levelStr);
        String fullClassStr = matcher.group("class");

        String[] classNameArray = fullClassStr.split("\\.");
        framework = classNameArray[2];
        classname = classNameArray[classNameArray.length - 1];
        threadname = matcher.group("thread");
        logtext = matcher.group("text");
        notes = "";

    } catch (Exception ex) {
        throw ex;
    }
}

What I want to figure out

What I really want to do is read the whole file as a single string, then use a single regex to parse this out line by line, using a single regular expression once. My plan was to use the same expression I use in the constructor, but when looking for the log text make it end at either EOF or the next log line, as such

final String timestampRgx = "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})";
final String levelRgx = "(?<level>(?>INFO|ERROR|WARN|TRACE|DEBUG|FATAL))";
final String classRgx = "\\[(?<class>[^]]+)\\]";
final String threadRgx = "\\[(?<thread>[^]]+)\\]";
final String textRgx = "(?<text>.*[^(\Z|\\d{4}\-\\d{2}\-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})"; // change to handle multiple lines
private static Pattern PatternFullLog = Pattern.compile(timestampRgx + " " + levelRgx + "\\s+" + classRgx + "-" + threadRgx + "\\s+" + textRgx + "$", Pattern.DOTALL);

try {
    // Read file into string
    String lines = readFile(filepath);

    Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}");
    Matcher matcher;

    matcher = pattern.matcher(line);
    while(matcher.find())
        String dateStr = matcher.group("timestamp");
        timestamp = new DateLogLevel();
        timestamp.parseLogDate(dateStr);

        String levelStr = matcher.group("level");
        loglevel = LOG_LEVEL.valueOf(levelStr);
        String fullClassStr = matcher.group("class");

        String[] classNameArray = fullClassStr.split("\\.");
        framework = classNameArray[2];
        classname = classNameArray[classNameArray.length - 1];
        threadname = matcher.group("thread");
        logtext = matcher.group("text");
        entries.add(
            new LogEntry(
                timestamp,
                loglevel,
                framework,
                threadname,
                logtext,
                ""/* Notes are empty when importing new file */));
        }
    }

}catch (Exception ex){
    return null;
}

The problem is that I can't seem to get the last group (textRgx) to multiline match until either a timestamp or end of file. Does anyone have any thoughts?

Sample Log Entries

2017-03-14 22:43:14,405 FATAL [org.springframework.web.context.support.XmlWebApplicationContext]-[localhost-startStop-1] Refreshing Root WebApplicationContext: startup date [Tue Mar 14 22:43:14 UTC 2017]; root of context hierarchy
2017-03-14 22:43:14,476 INFO  [org.springframework.beans.factory.xml.XmlBeanDefinitionReader]-[localhost-startStop-1] Loading XML bean definitions from Serv
2017-03-14 22:43:14,476 INFO  [org.springframework.beans.factory.xml.XmlBeanDefinitionReader]-[localhost-startStop-1] Here is a multiline
log entry with another entry after
2017-03-14 22:43:14,476 INFO  [org.springframework.beans.factory.xml.XmlBeanDefinitionReader]-[localhost-startStop-1] Here is a multiline
log entry with no entries after
clamport
  • 191
  • 1
  • 2
  • 13
  • aren't there tools for this? http://stackoverflow.com/questions/2590251/is-there-a-log-file-analyzer-for-log4j-files – Eugene Mar 30 '17 at 15:03
  • None of the tools that I see have all the features that I want unfortunately :( – clamport Mar 30 '17 at 15:14
  • pardon my ignorance, but what features exactly? this should have been implemented by someone. you will have a hard time supporting this (besides a big amount of work that you will have to do ) – Eugene Mar 30 '17 at 15:19
  • Specifically two things. I want to be able to correlate lines (ie, add a category or tab, and be able to select lines that I want to continue investigation) and I want to be able to note specific log lines. Preferably multi platform, but Mac if not – clamport Mar 30 '17 at 15:22
  • 1) You should escape a `]` inside a character class in Java regex - `[^]]` -> `"[^\\]]"`, 2) to match up to the next timestamp or end of file, use `(?s)(?.*?(?=\\d{4}\-\\d{2}\-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}|\\Z)"`. – Wiktor Stribiżew Mar 30 '17 at 15:52

1 Answers1

3

You need to define the patterns like

final static String timestampRgx = "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})";
final static String levelRgx = "(?<level>INFO|ERROR|WARN|TRACE|DEBUG|FATAL)";
final static String classRgx = "\\[(?<class>[^\\]]+)]";
final static String threadRgx = "\\[(?<thread>[^\\]]+)]";
final static String textRgx = "(?<text>.*?)(?=\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}|\\Z)";
private static Pattern PatternFullLog = Pattern.compile(timestampRgx + " " + levelRgx + "\\s+" + classRgx + "-" + threadRgx + "\\s+" + textRgx, Pattern.DOTALL);

Then, you may use that like

Matcher matcher = PatternFullLog.matcher(line);

See the Java demo

Here is what the pattern looks like:

(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) (?<level>INFO|ERROR|WARN|TRACE|DEBUG|FATAL)\s+\[(?<class>[^\]]+)]-\[(?<thread>[^\]]+)]\s+(?<text>.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}|\Z)

See the regex demo.

Some notes:

  • You had several issues with escaping symbols (] inside a character class must be escaped, and \- should have been replaced with -
  • The pattern to match text up to the datetime or end of string is (?<text>.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}|\Z) where .*? matches any char, 0+ occurrences, reluctantly, up to the first occurrence of the timestamp pattern (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) or end of string (\Z).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks haha, the `\-` were artifacts from an attempt where I put them in []. Thanks for the great answer! – clamport Mar 30 '17 at 17:05