0

I need to create a regular expression in python that can take the following sample and split out each log entry. I'm using the date as a way to identify the beginning of each log entry but it is only able to get a single line from where the date starts to the end of the first line. It completely misses all of the stack trace stuff. I want all of the log entry because there is a lot of repeated logging and I want to be able to filter out repeats and just reduce it down to a handful of unique log entries. I would also want to be able to remove anything unique about the string like date timestamp once I've identified a log entry so that a comparison function could flag it as a duplicate. I've tried to use positive lookaheads and multiline flags but to no avail. Anyone know what I am trying to do?

Some regular expressions I've tried

^\d{4}-\d{2}-\d{2}.*\(.*\)$ // it matches single line date to parenthesis
^(\d{4}-\d{2}-\d{2}|\s|).*\)$ // matches single line with tabs - not much better
^\d{4}-\d{2}-\d{2}.*(?=\d{4}-\d{2}-\d{2}) // positive lookahead but barely works

Sample string:

2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)

Desired output:

Match 1:

INFO:Starting.  (com.X.s.f.o.o)

Match 2:

SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

Match 3:

SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

Match 4:

SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)

Match 5:

SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)
nndhawan
  • 597
  • 6
  • 24

2 Answers2

0

Theres no need to try and match the entire string with regex, you can just match the dates and use that to separate the string into the desired logs:

import re

sample="""2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)"""

def date_match(s):
    """Returns true if the beginning of this string matches a date and time."""
    return bool(re.match("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}", s))

def yeild_matches(full_log):
    log = []
    for line in full_log.split("\n"):
        if date_match(line): # if this line starts with a date
            if len(log) > 0: # if theres already a log...
                yield "\n".join(log) # ... yield the log ...
                log = [] # ... and set the log back to nothing.

        log.append(line) # add the current line to log (list)

    yield "\n".join(log) # return the last log (theres no date at the end of the string to end the last log)

logs = list(yeild_matches(sample))

for i, l in enumerate(logs):
    print("Match {}:\n{}\n".format(i + 1, l))

yield_matches will add each line to a list called log, till it finds another date. When it finds a date, it yields the current log, and sets the log to empty.

Heres what the output looks like:

Match 1:
2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)

Match 2:
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

Match 3:
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

Match 4:
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)

Match 5:
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)
Sean Breckenridge
  • 1,932
  • 16
  • 26
0

I was able to figure it out after reading through the following pieces of information:

python: multiline regular expression

https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch02s08.html

The following regular expression matches a log entry if it starts with date ^\d{4}-\d{2}-\d{2} and continues to look ahead (?=...) until the very first time another date entry is found .+? and returns it as a match. This matches over a multi-line string! :D

^\d{4}-\d{2}-\d{2}.+?(?=\d{4}-\d{2}-\d{2})

The following regular expression will do the same thing as @Sean Breckenridge's solution but this time get rid of the unique part of the string I am trying to get rid of. Very useful!

(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}).+?(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}|\Z)
nndhawan
  • 597
  • 6
  • 24