(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*)+
Give it a whirl
Note:
- this doesn't tolerate deviations in spacing (you'd just need to replace all
s with \s+
s)
- will only match blocks of duplicates, not each duplicate individually (one match in your example, encompassing the two lines)
- duplicates will only be recognised if they are sequential (this is to keep the regex efficient)
I disagree with @ggorlen's assessment, regex will be literally the fastest thing you can do for a problem that requires this amount of expressive power.
If, however, you need to match "Starting" lines that aren't sequential, but you can guarantee that lines will be in-order (which for basically all logs will be the case, "Starting" and non-"starting" lines of the same minute will all be next to each other) we can accommodate that and still keep it reasonably efficient:
(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?:\n\1\d{2}\.\d{3}\2\[\d+\][^\n]*)*\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*
Have a play to make sure it works for your needs
This has the same caveat that it matches b̲l̲o̲c̲k̲s̲, thus non-starting lines betwixt matching "starting" lines will be matched still.
Sacrificing efficiency to get an individual match per line we can use lookahead/behinds for the two halves.
We'll need to duplicate the regex to capture the different ends of a block.
Some browsers wont even let you do crazy stuff like this, and even though Chrome can none of the online testers would let me give you a breakdown of the resulting regex
(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?=(?:\n\1\d{2}\.\d{3}\2\[\d+\][^\n]*)*\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*)|(?<=(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?:\n\4\d{2}\.\d{3}\5\[\d+\][^\n]*)*)\n\4\d{2}\.\d{3}\5\[\d+\]\6[^\n]*
Luckily, PowerShell (as you mentioned you're using in your comment) still handles it just fine, but I'm certain it'll crawl to a halt for large logfiles.
(
([regex](
(
'(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?=(?:\n\1\d{2}\.\d{3}\2\[\d+\][^\n]*)*\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*)',
'(?<=(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?:\n\4\d{2}\.\d{3}\5\[\d+\][^\n]*)*)\n\4\d{2}\.\d{3}\5\[\d+\]\6[^\n]*'
) -join '|')
).Matches((
'2019-10-31 05:49:52.416 +10:00 [1] - Starting',
'2019-10-31 05:49:53.416 +10:00 [2] - not starting',
'2019-10-31 05:49:53.416 +10:00 [2] - Starting',
'2019-10-31 05:49:53.416 +10:00 [3] - Starting',
'2019-10-31 06:53:58.416 +10:00 [1] - Starting',
'2019-10-31 06:53:58.416 +10:00 [1] - Identical but not "starting"'
) -join "`n")
).Value
