3

I have some data that looks like this:

DEC 12, 2020
incoming 192.168.0.5 10:30
outgoing 192.168.0.5 13:23
DEC 13, 2020
incoming 192.168.0.6 09:34
outgoing 192.168.0.6 14:12

I am trying to get the date and all data for that date into one grouping like so:

First match
Group 1 - DEC 12, 2020
Group 2 - incoming 192.168.0.5 10:30
          outgoing 192.168.0.5 13:23

Second match
Group 1 - DEC 13, 2020
Group 2 - incoming 192.168.0.6 09:34
          outgoing 192.168.0.6 14:12

I have tried this regex:

^([A-Z] \d+, \d{4})(.*)

The problem is, this reads all the way to the end instead of stopping at the next match (DEC 13, 2020) like so:

Group 1 - DEC 12, 2020
Group 2 - incoming 192.168.0.5 10:30
          outgoing 192.168.0.5 13:23
          DEC 13, 2020
          incoming 192.168.0.6 09:34
          outgoing 192.168.0.6 14:12

If I add the ? like so:

^([A-Z] \d+, \d{4})(.*?)

The I get only the dates.

First Match
Group 1 - DEC 12, 2020
Group 2 - white space

Second Match
Group 1 - DEC 13, 2020
Group 2 - white space

Can someone please tell me what I am missing? How can I get it to stop at the next match and not the end of the line or end of the text? All lines have a CRLF at the end. Thanks.

J_K_M_A_N
  • 77
  • 6
  • 4
    Why don't you simply split by newlines in whatever language you use. Every *not MOD 3* will be your 1st group, the rest is your second group. – Roko C. Buljan Jan 07 '21 at 16:28
  • `^([A-Z]{3} \d+, \d{4})((?:\n(?![A-Z]{3} \d).*)*)`? See https://regex101.com/r/mZy8no/1 – Wiktor Stribiżew Jan 07 '21 at 16:29
  • Sorry. I should have mentioned that most days will have a different amount of entries. I simplified it above and used 2 lines for both. I should not have done that. – J_K_M_A_N Jan 07 '21 at 16:45
  • Wiktor, I use this site to test my regex: http://regexstorm.net/tester Since it seems to line up with VB Net for me 99% of the time. Your regex gave me the same white space and all the dates. :( – J_K_M_A_N Jan 07 '21 at 17:02
  • Use `(?m)^([A-Z]{3} \d+, \d{4})((?:\r?\n(?![A-Z]{3} \d).*)*)`, see [demo](http://regexstorm.net/tester?p=%28%3fm%29%5e%28%5bA-Z%5d%7b3%7d+%5cd%2b%2c+%5cd%7b4%7d%29%28%28%3f%3a%5cr%3f%5cn%28%3f!%5bA-Z%5d%7b3%7d+%5cd%29.*%29*%29&i=DEC+12%2c+2020%0d%0aincoming+192.168.0.5+10%3a30%0d%0aoutgoing+192.168.0.5+13%3a23%0d%0aDEC+13%2c+2020%0d%0aincoming+192.168.0.6+09%3a34%0d%0aoutgoing+192.168.0.6+14%3a12). – Wiktor Stribiżew Jan 07 '21 at 17:03
  • Nailed it! Thanks Wiktor. Do you want to do a format answer so I can accept it? Thank you for the help! – J_K_M_A_N Jan 07 '21 at 17:07
  • Roko, I definitely could have done that (I use VB.net) but I really like regex and I want to expand my knowledge on that. That is why I wanted to learn this way. Thank you for the suggestion though. – J_K_M_A_N Jan 07 '21 at 17:34

1 Answers1

1

You can use

(?m)^([A-Z]{3} \d+, \d{4})((?:\r?\n(?![A-Z]{3} \d).*)*)

See the regex demo. Details:

  • (?m) - a RegexOptions.Multiline inline option
  • ^ - start of a line
  • ([A-Z]{3} \d+, \d{4}) - Group 1: three uppercase ASCII letters, space, one or more digits, a comma, a space and then four digits
  • ((?:\r?\n(?![A-Z]{3} \d).*)*) - Group 2: zero or more occurrences of
    • \r?\n - a CRLF or LF only line break sequence...
    • (?![A-Z]{3} \d) - that is not immediately followed with three uppercase ASCII letters, space, digit
    • .* - the rest of the line.

Output:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you very much. I did not know about nesting the \r?\n like that. That is why I like to post on this site. So I can learn something new and have reference to it. :) Thanks again! – J_K_M_A_N Jan 07 '21 at 17:26
  • 1
    @J_K_M_A_N `(?:...)` is a [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions) that is used to *group* sequences of patterns, so that they could be quantified together as a single group. The capturing parentheses around this `(?:...)*` have a side-effect of keeping the initial line break in the Group 2 value, so you should `.Trim()` the value or regroup the patterns repeating them: `(?m)^([A-Z]{3} \d+, \d{4})\r?\n((?![A-Z]{3} \d).*(?:\r?\n(?![A-Z]{3} \d).*)*)` – Wiktor Stribiżew Jan 07 '21 at 17:35
  • Thanks. I usually use a .Replace(vbCr,"") when reading these groups. (Or vbCrLf) Roko is right that it would be much easier to read line by line, but again, I like regex. :) – J_K_M_A_N Jan 07 '21 at 17:41