1

I have a huge log file with different types of string rows, and I need to extract data in a "smart" way from these.

Sample snippet:

2011-03-05 node32_three INFO stack trace, at empty string asfa 11120023
--- - MON 23 02 2011 ERROR stack trace NONE      

For instance, what is the best way to extract the date from each row, independent of date format?

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
carloscloud
  • 351
  • 1
  • 3
  • 14
  • Do you mean 'extract the dates' ? because your example has two dates. – pavium Apr 13 '11 at 12:52
  • do I understand correctly that your huge log file contains different types of rows in which date may appear in different formats? If this is the case, then regex may not be a good solution. – MarcoS Apr 13 '11 at 12:55
  • 1
    @heykalrm: I edited your question to show individual lines in your example, but I'm not sure I got it right. Please check it and verify that the line split is in the correct place. – Jim Mischel Apr 13 '11 at 12:58
  • @MarcoS Yes, the dates may appear in different formats. If not regex, what's your solution? – carloscloud Apr 18 '11 at 06:51
  • 1
    I gave an answer with an alternative approach: use regex to separate date stings and Joda time to parse them. See my answer. – MarcoS Apr 18 '11 at 10:38

2 Answers2

3

You could make a regex for different formats like so:

 (fmt1)|(fmt2)|....

Where fmt1, fmt2 etc are the individual regexes, for yor example

(20\d\d-[01]\d-[0123]\d)|((?MON|TUE|WED|THU|FRI|SAT|SUN) [0123]\d [01]\d 20\d\d)

Note that to prevent the chance to match arbitrary numbers I restricted year, month and day numbers accordingly. For example, a day number cannot start with 4, neither can a month number start with 2.

This gives the following pseudo code:

// remember that you need to double each backslash when writing the
// pattern in string form
Pattern p = Pattern.compile("...");    // compile once and for all
String s;
for each line 
    s = current input line;
    Matcher m = p.matcher(s);
    if (m.find()) {
        String d = m.group();    // d is the string that matched
        ....
    }

Each individual date pattern is written in () to make it possible to find out what format we had, like so:

        int fmt = 0;
        // each (fmt) is a group, numbered starting with 1 from left to right
        for (int i = 1; fmt == 0 && i <= total number of different formats; i++) 
            if (m.group(i) != null) fmt = i;

For this to work, inner (regex) groups must be written (?regex) so that they do not count as capture-groups, look at updated example.

Ingo
  • 36,037
  • 5
  • 53
  • 100
1

If you use Java, you may want to have a look at Joda time. Also, read this question and related answers. I think Joda DateTimeFormat should give you all the flexibility that you need to parse the various date/time format of your log file.

A quick example:

String dateString = "2011-04-18 10:41:33";
DateTimeFormatter formatter = 
  DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
DateTime dateTime = formatter.parseDateTime(dateString);

Just define a String[] for the formats of you date/time, and pass each element to DateTimeFormat to get the corresponding DateTimeFormatter. You can use regex just separate date strings from other stuff in the log lines, and then you can use the various DateTimeFormatters to try and parse them.

Community
  • 1
  • 1
MarcoS
  • 13,386
  • 7
  • 42
  • 63