0

I want to parse specific tcpdump patterns and use optional matches to account for some optional parts (regex101 demo):

10:14:48.983541 IP 10.242.136.232.34266 > 10.81.163.129.9200: Flags [S], seq 2294574211, win 29200, options [mss 1460,sackOK,TS val 22536912 ecr 0,nop,wscale 7], length 0
10:14:48.983541 IP 10.242.136.232 > 10.81.163.129.9200: fictional stuff
10:14:48.983541 IP 10.242.136.232 > 10.81.163.129: also fictional stuff

The general structure for the string is "something, IP address, optional port, the > sign, IP, optional port, colon, something", separated by whitespaces. My match pattern for that is

.+(?P<src_ip>\d*\.\d*\.\d*\.\d*)(?:\.(?P<src_port>\d*))?.>.(?P<dst_ip>\d*\.\d*\.\d*\.\d*)(?:\.(?P<dst_port>\d*))?:\.*

In the demo regex above, it seems that the match is done from the right (mostly correctly) but then something happens on the way to the left and the first octet of the IP (the first \d* in the pattern) is never matched. Why?

Note: the last two "tcpdump outputs" are technically incorrect, I wanted to show some variations around optional elements.

Community
  • 1
  • 1
WoJ
  • 27,165
  • 48
  • 180
  • 345
  • Try a lazy `.+?` at the start - [`.+?(?P\d*\.\d*\.\d*\.\d*)(?:\.(?P\d*))?.>.(?P\d*\.\d*\.\d*\.\d*)(?:\.(?P\d*))?:\.*`](https://regex101.com/r/sNTX95/1). I think you want to match 1 or more digits in all cases, too, so, you need to replace `\d*` with `\d+`. If the `>` is enclosed with whitespace, replace `.>.` with `\s*>\s*` – Wiktor Stribiżew Oct 04 '16 at 08:40
  • You can always switch to regex debugger in regex101 (PCRE only), to see what actually happens. – Sebastian Proske Oct 04 '16 at 08:43
  • @WiktorStribiżew: it works, thank you. Would you mind turning this into an answer so that I can accept it? (a good explanation of lazy vs. greedy, one one knows this exists, is at http://stackoverflow.com/q/2301285/903011) – WoJ Oct 04 '16 at 08:45
  • @WiktorStribiżew: as for the last part of your comment: if I know that I will have several digits (as you correctly guessed), does it matter if `\d*` or `\d+` is used? (I know the difference - I was just wondering why one would be better than the other when the number of matches is one or more) – WoJ Oct 04 '16 at 08:47
  • If you want to anchor the regex at the right side place an `$` at the end of the pattern. `$` means end of line. – hek2mgl Oct 04 '16 at 08:47
  • @SebastianProske: thanks, I did not know that part, it is indeed useful (also to understand how the matching works at all) – WoJ Oct 04 '16 at 08:49

1 Answers1

0

I see several potential "bottlenecks" here the main issue being the first greedy .+. This subpattern grabs the whole string first and starts backtracking by trying to accommodate texts for the subsequent patterns. Thus, it makes the digits match "from the right". Turning it to a lazy .+? will make the regex skip the subpattern and try the subsequent subpatterns first, and only upon no match the lazy .+? will get "expanded", and the digits will get matched from the right.

Another way to make it work is to specify the unique context before these digits and it is a space here. Just add a space after a greedy .+ and the backtracking will grab one or more chars up to the last space that is followed with the rest of the subpatterns. See this regex demo.

Also, the last \.* is not necessary, you may remove it. You seem to want to match 1 or more digits in all cases, so, you may replace all \d* with \d+. If the > is enclosed with whitespace, replace .>. with \s*>\s*.

I suggest:

.+?(?P<src_ip>\d+\.\d+\.\d+\.\d+)(?:\.(?P<src_port>\d+))?\s*>\s*(?P<dst_ip>\d+\.\d+\.\d+\.\d+)(?:\.(?P<dst_port>\d+))?:

or a bit contracted version with the limiting quantifiers:

.+?(?P<src_ip>\d+(?:\.\d+){3})(?:\.(?P<src_port>\d+))?\s*>\s*(?P<dst_ip>\d+(?:\.\d+){3})(?:\.(?P<dst_port>\d+))?:

See this regex demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563