2

We are looking to dump our PMDF logs into Splunk and I am trying to parse the PMDF SMTP logs, specifically the message, and I'm hitting an issue where a named capturing group (dst_channel) may or may not have a value. Here is my regex so far:

\d{2}\-\w{3}\-\d{4}\s\d{2}\:\d{2}\:\d{2}\.\d{2}\s(?P<src_channel>\w+)\s+(?P<dst_channel>\w+)\s(?P<code>\w+)\s(?P<bytes>\d+)\s(?P<from>\w.+)\srfc822

I'm able to match the following message, in which tcp_msx_out_2 is the dst_channel

02-Feb-2017 08:00:19.60 tcp_exempt   tcp_msx_out_2 E 2 mailman-bounces@list.xyz.com rfc822;user@xyz.com user@xyz.com <mailman.157.1486040414.29131.xxx@xxx.xyz.com> pmdf list.xyz.com ([x.x.x.x])

however, I'm not matching the following logs that doesn't contain a dst_channel value:

02-Feb-2017 09:00:01.59 tcp_imap_int              Q 12 xxx@xyz.com rfc822;user@imap-internal.xyz.com user@imap.xyz.com <6940401380880269855036@PT-D69> pmdf  user@imap.xyz.com: smtp;452 4.2.2 Over quota

The next named capturing group I have is code E in the first message example, and Q in the second), and when the dst_channel is not there, the regex is not capturing all of the codes.

How can I modify my regex for conditional statements so that if the dst_channel is there, it grabs the value, but if not, regex continues on and is able to consistently grab the values for the other named capturing groups I have?

2 Answers2

1

It worked if i changed the \w+ to a \w*

\d{2}\-\w{3}\-\d{4}\s\d{2}\:\d{2}\:\d{2}\.\d{2}\s(?P<src_channel>\w+)\s+(?P<dst_channel>\w*)\s(?P<code>\w+)\s(?P<bytes>\d+)\s(?P<from>\w.+)\srfc822

You can test it here

maraaaaaaaa
  • 7,749
  • 2
  • 22
  • 37
1

I suggest you use

\d{2}-\w{3}-\d{4}\s+\d{2}:\d{2}:\d{2}\.\d{2}\s+(?P<src_channel>\w+)(?:\s+(?P<dst_channel>\w+))?\s+(?P<code>\w+)\s+(?P<bytes>\d+)\s+(?P<from>\S+)\s+rfc822
                                                                   ^^^                       ^^  

See the regex demo.

Basically, replace all \s with \s+ and make the dst channel group optional by wrapping both the \s+ and the whole dst channel group with an optional non-capturing group.

Also, the from group pattern should be replaced with \S+ (one or more chars other than whitespace) because you want to match an email, and .+ may - and usually it does - overmatch.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This is the answer, this regex is way more efficient than mine, less than half the steps – maraaaaaaaa Feb 02 '17 at 16:52
  • Wiktor, thx a million for the reply and detailed regex as it's greatly appreciated. Is it safe to say any time I want to signify a non-capturing group I need to wrap both the \s+ and the captured naming group: (?:\s+(?P\w+))? – user3723206 Feb 02 '17 at 17:44
  • An *optional* capturing group, `(?:...)?`. It does not *always* work (it depends on the patterns around that group), but usually works well, especially after we got rid of all `.*`-like patterns. – Wiktor Stribiżew Feb 02 '17 at 17:45
  • One last question - do I handle the tailing end of the message with a non-capturing group as well? The end of message one is: smtp;452 4.2.2 Over quota The end of message two is: ([1.1.1.1]) I want to be able to capture the source IP if it's listed Thx – user3723206 Feb 02 '17 at 18:09
  • Thx again for the clarification – user3723206 Feb 02 '17 at 18:18
  • @wiktor - last question, promise! I have a named capturing group for sending_domain as such - (?P\w.+), but I'm having trouble handling the different values of user@imap.xyz.com: and user@xyz.com (space) How do I have regex capture the sending domain no matter what characters are included and how end (: vs space vs whatever follows)? Thx – user3723206 Feb 02 '17 at 18:27
  • No idea what you mean. Maybe [something like this](https://regex101.com/r/mGBTA4/3). – Wiktor Stribiżew Feb 02 '17 at 20:12
  • Apologize for the lack of clarity - for the sending domain, sometimes I see email addresses conforming to the standard format of user@xyz.com|edu|net (etc)., but sometimes I'll see other formats, such as user@.xyz.com which may or may not contain special characters - user@internal-imap.xyz.com. I'm trying to end the naming captured group for domain after the email address concludes with |edu|net (etc), and not pick up the special characters (so far I've only seen a colon) before applying the naming captured group for the IP. – user3723206 Feb 02 '17 at 21:06
  • Add a `\b` at the end of the group. – Wiktor Stribiżew Feb 02 '17 at 21:07