How do you match duplicate values on multiple lines?

Question

I'm trying to match multiple lines with duplicate values. My script will continue if there is more than one match. I've been going through the backreferences documentation, but I can't seem to get it right for my case.

The idea is to query a log file which contains time stamps and actions. I would like to match any lines within the log file which contain duplicate timestamps with a "Starting" string contained on the line.

Using this pattern:

^(\b\d+)-(\d{2})-(\d{2}) (\d+):(\d{2})(?=\b[\s\S]*Starting\b)(?=[\s\S]*\b\1\b)

I'm hoping to match the first two lines, simply because the time stamps are the exact same.

2019-10-31 05:49:52.416 +10:00 [1] - Starting
2019-10-31 05:49:53.416 +10:00 [1] - Starting
2019-10-31 06:53:58.416 +10:00 [1] - Starting

At the moment, it only captures the first line (1 match). How do I get it to match duplicate values on multiple lines?

EDIT:

For clarification, my pattern is looking duplicate values for YYYY-MM-DD HH:MM.

My pattern is looking for YYYY-MM-DD HH:MM. This needs to be regex as I am using powershell to remote onto the machine and query the log files. — juiceb0xk, Dec 17 '19 at 03:01
OK--I'll remove the PHP tag then. Regex is pretty inefficient at this sort of thing, so hopefully you have a small amount of data. — ggorlen, Dec 17 '19 at 03:03

Simon · Accepted Answer · 2019-12-17T06:33:36.360

The example

(?<log>(?<ymdhm>\d{4}-\d{2}-\d{2} \d{2}:\d{2}).*?(?<flag>Starting)$)\n\k<ymdhm>.*?\k<flag>

[Update]
OK I updated the regex, it's not easy as I expected.

Here is the explanation:

The group "log" matches a single line by your basic rule. It has several parts:
1. (?<ymdhm>\d{4}-\d{2}-\d{2} \d{2}:\d{2}) "ymdhm" YY-MM-DD HH:MM, this is import for later match, because you care the time digits until minute, next qualified line must has exactly same pattern like this one.
2. (?<flag>Starting)$ "flag" is the import pattern , it's what you are looking for, i.e. the "flag".
3. .*? In the middle between them are the characters you don't care too much.
Then, it must has another line \n. Here the regexp is using the flags gm. Without \n it will stop checking the following line.
\k<ymdhm> means to apply the same pattern like last group "ymdhm", this means time of the next log should has the same digits. Explanation for \k.
Then lazy match arbitrary characters.
Then \k<flag> matches the same flag pattern as it in last matched one.

Not quite. This pattern is looking for any lines with "Starting" in it. A matched line must have "Starting" in it, yes, but the timestamps must also be the same for each line. There should only be two matches with the test string you have provided. — juiceb0xk, Dec 17 '19 at 03:48

Hashbrown · Answer 2 · 2019-12-17T05:50:06.097

(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*)+

Give it a whirl

Note:

this doesn't tolerate deviations in spacing (you'd just need to replace all s with \s+s)
will only match blocks of duplicates, not each duplicate individually (one match in your example, encompassing the two lines)
duplicates will only be recognised if they are sequential (this is to keep the regex efficient)

I disagree with @ggorlen's assessment, regex will be literally the fastest thing you can do for a problem that requires this amount of expressive power.

If, however, you need to match "Starting" lines that aren't sequential, but you can guarantee that lines will be in-order (which for basically all logs will be the case, "Starting" and non-"starting" lines of the same minute will all be next to each other) we can accommodate that and still keep it reasonably efficient:

(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?:\n\1\d{2}\.\d{3}\2\[\d+\][^\n]*)*\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*

Have a play to make sure it works for your needs

This has the same caveat that it matches b̲l̲o̲c̲k̲s̲, thus non-starting lines betwixt matching "starting" lines will be matched still.

Sacrificing efficiency to get an individual match per line we can use lookahead/behinds for the two halves.
We'll need to duplicate the regex to capture the different ends of a block.

Some browsers wont even let you do crazy stuff like this, and even though Chrome can none of the online testers would let me give you a breakdown of the resulting regex

(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?=(?:\n\1\d{2}\.\d{3}\2\[\d+\][^\n]*)*\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*)|(?<=(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?:\n\4\d{2}\.\d{3}\5\[\d+\][^\n]*)*)\n\4\d{2}\.\d{3}\5\[\d+\]\6[^\n]*

Luckily, PowerShell (as you mentioned you're using in your comment) still handles it just fine, but I'm certain it'll crawl to a halt for large logfiles.

(
    ([regex](
        (
            '(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?=(?:\n\1\d{2}\.\d{3}\2\[\d+\][^\n]*)*\n\1\d{2}\.\d{3}\2\[\d+\]\3[^\n]*)',
            '(?<=(?:^|\n)(\d{4}(?:-\d{2}){2} (?:\d{2}:){2})\d{2}\.\d{3}( [+-]\d{2}:\d{2} )\[\d+\]( - Starting)[^\n]*(?:\n\4\d{2}\.\d{3}\5\[\d+\][^\n]*)*)\n\4\d{2}\.\d{3}\5\[\d+\]\6[^\n]*'
        ) -join '|')
    ).Matches((
        '2019-10-31 05:49:52.416 +10:00 [1] - Starting',
        '2019-10-31 05:49:53.416 +10:00 [2] - not starting',
        '2019-10-31 05:49:53.416 +10:00 [2] - Starting',
        '2019-10-31 05:49:53.416 +10:00 [3] - Starting',
        '2019-10-31 06:53:58.416 +10:00 [1] - Starting',
        '2019-10-31 06:53:58.416 +10:00 [1] - Identical but not "starting"'
    ) -join "`n")
).Value

score 1 · Answer 3 · answered Dec 17 '19 at 10:30

You could use a backreference by capturing the part of the timestamp that you would consider to be the same on all the following lines, and you might also capture the part of Starting in a second capturing group.

Then you could repeat matching all the lines that start with the same value as group 1 and contain group 2 in the line.

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}).*(\bStarting\b)(?:\R\1.+\2)+

^ Start of the line
( Capture group 1
- \d{4}-\d{2}-\d{2} \d{2}:\d{2} Match the timestamp like format you want to capture
) Close group
.+ Match any char except a newline 1+ times
( Capture group 2
- \bStarting\b Match Starting between word boundaries
) Close group
(?: Non capturing group
- \R\1.+\2 Match Unicode newline sequence, a backreference to what is captured in group 1, 1+ times any char except a newline and a backreference to what is captured in group 2
)+ Close non capturing group and repeat 1+ times to match at least 2 lines

Regex demo

@juiceb0xk please read [what to do when someone answers my question?](https://stackoverflow.com/help/someone-answers). Consider accepting an answer if it solves the problem. — ggorlen, Dec 22 '19 at 04:54

How do you match duplicate values on multiple lines?

3 Answers3