5

I'm trying to parse log entries in a C# app using this regex: (^[0-9]{4}(-[0-9]{2}){2}([^|]+\|){3})(?!\1) for logs in a format like [date (in some format)] | [level] | [appname] | [message].

Where (I think):

  • ^ matches the begin of a line (enabled /gm on regex101)
  • [0-9]{4}(-[0-9]{2}){2} followed by the begin of the date like 2015-03-03
  • ([^|]+\|){3}) followed by the rest of the date, the log level and app name
  • (?!\1) followed by not the start of a new log entry (should be the message)

For example, I have the following 4 log entries (separated by a newline for clarification):

2015-03-03 19:30:47.2725|INFO|MyApp|This is a single line log message.

2015-03-03 19:31:29.1209|INFO|MyApp|This log message has multiple
lines with
2015-03-03
a date in it.

2015-03-03 19:32:50.1106|INFO|MyApp|This log message has
multiple lines
but just text only.

2015-03-03 19:33:20.2683|ERROR|MyApp|This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

But the regex does not capture the message when I test it on regex101, probably because I don't understand how to capture the negative lookahead.

If I include .* in the regex: (^[0-9]{4}(-[0-9]{2}){2}([^|]+\|){3}).*(?!\1) it matches the message but only a single line (because . does not match a newline).

So how can I capture the (multiline) message?

Kapé
  • 4,411
  • 3
  • 37
  • 54
  • what language are you using here? there are several different types of regex depending on the environment so please be specific. – phillip Mar 04 '15 at 20:21
  • @phillip I want to use it in a C# app, but I first tried to make it work using the default PHP flavor of regex101. – Kapé Mar 04 '15 at 21:33

3 Answers3

3

You can use this regex:

(^\d{4}(-\d{2}){2}([^|]+\|){3})([\s\S]*?)\n*(?=^\d{4}.*?(?:[^|\n]+\|){3}|\z)

RegEx Demo

This regex should work in C# as well, just make sure to use MULTILINE flag.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • @anubhava Thanks for your answer, but I've the same issue as I just commented on [Necreaux answer](http://stackoverflow.com/a/28864065/465942) – Kapé Mar 04 '15 at 22:21
  • I did read but don't understand for which situation it doesn't work. Can you update the demo link and provide me an updated regex101 link showing what doesn't work. – anubhava Mar 04 '15 at 22:27
  • [Even with your updated sample data it works](https://regex101.com/r/eP0eD5/2) since lookahead is `^\d{4}` i.e. year part at **start of next line** – anubhava Mar 04 '15 at 22:29
  • @anubhava Could you [check this link](https://regex101.com/r/eP0eD5/3)? The last message should be valid (that's at least what I want) but is not. – Kapé Mar 04 '15 at 22:35
  • 1
    In the last message you have missing `MyApp|` which is your App name. Can that really happen in application log? – anubhava Mar 04 '15 at 22:40
  • 1
    if it can then there is no regex in the world that will fit. in that case the logging should be split into multiple logs so that this would work but still the logger should be able to force the output to include fields – phillip Mar 04 '15 at 22:45
  • @phillip No, that's why that should not be part of a new log message. I count 4, not 5. Beside of this very unlikely entry, the 2nd log message which has only a date in the log text is not captured by the regex as well. – Kapé Mar 05 '15 at 09:14
  • @KP_: You didn't respond to the question I asked. **In the last message you have missing `MyApp|` which is your App name. Can that really happen in application log?** – anubhava Mar 05 '15 at 09:42
  • @anubhava Yes I did.. What you see as last log entry is not a log entry, so that is part of the message. I've separated the example log messages with a new line to clarify what a single log message is. As I said, this is not very likely to happen. – Kapé Mar 05 '15 at 10:00
  • @anubhava Thanks for the updated regex, but it has 5 matches, it should have 4. – Kapé Mar 05 '15 at 10:03
  • 1
    But earlier regex already had 4 and you said **last message should be valid** – anubhava Mar 05 '15 at 10:07
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/72312/discussion-between-kp-and-anubhava). – Kapé Mar 05 '15 at 10:10
3

Something like this should work.
See the comments in the regex.
(mod: make line break optional for EOS or single line message)

 @"(?m)^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r\n]+\|){3}((?:(?!^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r\n]+\|){3}).*(?:\r?\n)?)+)"

Formatted (with this):

 (?m)                          # Modifier - multiline
 ^                             # BOL
 [0-9]{4}                      # Message header
 (?: - [0-9]{2} ){2}
 (?: [^|\r\n]+ \| ){3}
 (                             # (1 start), The Message
      (?:
           (?!                           # Assert, not a Message header
                ^                             # BOL
                [0-9]{4} 
                (?: - [0-9]{2} ){2}
                (?: [^|\r\n]+ \| ){3}
           )
           .*                            # Line is ok, its part of the message
           (?: \r? \n )?                 # Optional line break
      )+
 )                             # (1 end)

Output:

 **  Grp 0 -  ( pos 0 , len 74 ) 
2015-03-03 19:30:47.2725|INFO|MyApp|This is a single line log message.


 **  Grp 1 -  ( pos 36 , len 38 ) 
This is a single line log message.

--------------

 **  Grp 0 -  ( pos 74 , len 108 ) 
2015-03-03 19:31:29.1209|INFO|MyApp|This log message has multiple
lines with
2015-03-03
a date in it.


 **  Grp 1 -  ( pos 110 , len 72 ) 
This log message has multiple
lines with
2015-03-03
a date in it.

--------------

 **  Grp 0 -  ( pos 182 , len 97 ) 
2015-03-03 19:32:50.1106|INFO|MyApp|This log message has
multiple lines
but just text only.


 **  Grp 1 -  ( pos 218 , len 61 ) 
This log message has
multiple lines
but just text only.

--------------

 **  Grp 0 -  ( pos 279 , len 186 ) 
2015-03-03 19:33:20.2683|ERROR|MyApp|This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

 **  Grp 1 -  ( pos 316 , len 149 ) 
This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.
-2

What regex engine are you using? In Java for example there is a flag to tell "." to match newline characters.

The following regex appears to do the trick:

/(([0-9]{4})(-[0-9]{2}){2}([^|]+\|){3})((.(?!\2))*)/sg

Modifications I made to your query were mostly some cleanup (your date capturing group was wrong). I then added a . and * in that final capturing group. https://regex101.com/r/fU1vV1/2

The most important part is the use of the sg flags. g makes it get all matches. s makes it treat it all like a single line (otherwise your negative lookahead would never work). All of this would be unnecessary if you could guarantee the comments were on one line (which they are in your example) since you could just capture to the end of the line.

Necreaux
  • 9,451
  • 7
  • 26
  • 43
  • @Necreaux It looks like that your answer does work! I see that you use the `s` (single line) modifier instead. Could you please explain the regex step by step here just like I did so that I and others can learn from this? (Instead of a link which might be broken someday) – Kapé Mar 04 '15 at 21:52
  • Further cleanup/clarification added. – Necreaux Mar 04 '15 at 22:04
  • @Necreaux I now see that when I have a log message which contains a date itself (or just a year), the message is not completely included in the regex anymore. A log message should be everthing except the start of a new log message which can be detected by the log format I described. Any suggestions? – Kapé Mar 04 '15 at 22:18
  • @Necreaux: You're making the same mistake as the OP, using a backreference (`\1` or `\2`) as a subroutine. A backreference doesn't try to match the same subexpression again, it tries to match the actual substring that was captured in that group. The reason your lookahead seems to work is because you're only comparing the first four characters, which all happen to be the same. Also, you left out the start-of-line anchor (`^` in multiline mode). – Alan Moore Mar 05 '15 at 02:36
  • @AlanMoore I totally agree about the backreference stuff. Maybe I should have approached this differently, but my approach was to try to point the OP in the correct direction by answering the specific question, rather than going off on a tangent to fix other issues in the regex. – Necreaux Mar 05 '15 at 13:13
  • @Necreaux: The regex I came up with was so similar to [@sln's](http://stackoverflow.com/a/28867622/20938), there was no point posting it. I was hoping to condense it some, but the additional details provided by the OP make it certain: you just have to write most of the regex twice. – Alan Moore Mar 05 '15 at 13:50
  • And I agree with you about tangents, but if you spot an outright error and don't point it out, you're effectively endorsing the incorrect behavior, making it that much harder for them to unlearn it later. And if you *fail* to notice the error (especially if you copy or repeat it in your answer) you undermine your own credibility. – Alan Moore Mar 05 '15 at 14:10