Match Partially Duplicated Lines

Question

I have rows in a list that are sometimes similar up to the first "space" character, then can change (i.e. a date afterwards).

wsmith jul/12/12
bwillis jul/13/13
wsmith jul/14/12
tcruise jul/12/12

I can easily sort the lines, but I'd love to remove the duplicate later dated entry. I did find a regex suggestion, but it matches only exactly the same lines. I need to be able to mark the entire row of similar usernames in the file. In my example above, lines 1 and 3 would be highlighted.

(edited for clarity)

I'm not sure what you're trying to do. 'Partially' is pretty subjective and in a sense, all the above 3 lines match partially because all begin with user.[.](http://regex101.com/r/eC5uK2/10). And how many 'partial duplicates' are you talking about? 2? 10? 100? — Jerry, Jul 25 '14 at 13:02
Up to the first space...i.e. the entire username. Ususally there would only be one or two other duplicates of the same user, typically just one. — techguy2014, Jul 25 '14 at 14:54
Ok in other words, you want a regex that highlights all the usernames that appear at least twice in the file and all that follows on the same line? If yes, you want it to also mark the first duplicate? — Jerry, Jul 25 '14 at 15:06
Exactly. To highlight the first or both of the entries even. — techguy2014, Jul 25 '14 at 16:06

zx81 · Answer 1 · 2014-07-26T05:23:43.740

4

A compact formula in the PCRE engine (used by Notepad++) to see if there is repetition from one row to another would be

(?m)^(\S+).*\R(?s).*?\K\1

This will work in N++.

enter image description here

As you remove duplicate lines, more may become marked, because initially the regex skips over the in-between lines in order to highlight the duplicate.

Explanation

(?m) turns on multi-line mode, allowing ^ and $ to match on each line
The ^ anchor asserts that we are at the beginning of the string
(\S+) captures non-space chars to Group 1
.* gets to the end of the line
\R line break
(?s) activates DOTALL mode, allowing the dot to match across lines
.*? lazily match chars up to ...
The \K tells the engine to drop what was matched so far from the final match it returns
\1 back-reference: match what Group 1 captured before.

edited Jul 26 '14 at 05:23

answered Jul 25 '14 at 03:24

zx81

41,100
9
89
105

Added demo and explanation, let me know if you have questions. :) – zx81 Jul 25 '14 at 03:36
I could then use it to 'Mark' the matches at least? That would be good enough really. I'll test it out thanks! – techguy2014 Jul 25 '14 at 04:12
@user3875369 If it works with Find, it will work with Mark. Just make sure your cursor is well placed, in this case, better at the start and you search down. – Jerry Jul 25 '14 at 07:25
@Unihedron Thank you... First I had the `(?s)` at the front and it was quite ugly. This give us the best of both worlds. – zx81 Jul 25 '14 at 09:01
@Unihedron Well, technicall, it's 'Suddenly, singleline!'. DOTALL is the name given in python and singleline is why it's `(?s)` and not `(?d)` ;) – Jerry Jul 25 '14 at 10:57
@Jerry @Unihedron FYI the PCRE doc is littered with `DOTALL` :) – zx81 Jul 25 '14 at 11:41
@Jerry Btw are you thinking that the term `DOTALL` started in Python? If so I was not aware of that, but not claiming to know the history. If you have any info on that I'd be interested. :) – zx81 Jul 25 '14 at 11:45
@zx81 Hmm, no, just know that it's `re.DOTALL` in python and `RegularExpressions.SingleLine` in C#. Didn't know about the name of that modifier till last year and just knew it as such. Will look into it some more when I'll be home – Jerry Jul 25 '14 at 11:57
Doesn't work. It needs to look at the beginning of each line, only up to the first space character, then find any matches of same in other rows. I tried the regex demo with live data and doesn't match at all. – techguy2014 Jul 25 '14 at 14:52
Hey there, for the record it DID work exactly according to your original question, where there was no mention of beginning of line, spaces etc. You added that later. Tweaked the regex to your new requirements, see screenshot. :) – zx81 Jul 26 '14 at 01:09

Jerry · Answer 2 · 2014-07-26T05:33:45.933

3

I propose this regex:

^(\S+) (?=(?s:.)*\1.*).*

It will mark the first users that have a duplicate.

regex101 demo

^          # Beginning of line
(\S+)      # Match and store non-spaces
           # One space
(?=        # Positive look-ahead begin
  (?s:.)*  # Match any character including newlines
  \1.*     # Match the matched group (i.e. the username) and anything following on same line
)          # End lookahead
.*         # Match anything remaining on line (mainly for the first match)

If notepad++ marked all capture groups, you would have been able to use this to highlight all duplicates including the last one:

^(\S+) (?=(?s:.)*(\1.*)).*

regex101 demo

But unfortunately (at least for v6.5.2), N++ doesn't mark the capture groups.

edited Jul 26 '14 at 05:33

answered Jul 25 '14 at 16:48

Jerry

70,495
13
100
144

Hey Jerry, I'm guessing you added an answer because my screenshot didn't look like his requirements. That's because he edited his question—originally there was just mention of repeating material from one line on another line, no mention of start of line etc. Gave it a small tweaked and changed the screenshot. +1 anyway for the different approach. :) – zx81 Jul 26 '14 at 01:12
@zx81 Nope, I added this one because it would highlight all the users in one go, plus all the duplicates bar the last one. In my demo, I have two duplicated users, `user1` and `user5` and both are correctly highlighted, with everything that is on these lines. – Jerry Jul 26 '14 at 04:53
1

I see. Nice!!! Btw unless I'm missing something, it looks like you can [shorten it a bit](http://regex101.com/r/aI4kP7/3). – zx81 Jul 26 '14 at 05:22
@zx81 Oh right. The initial version would highlight all including the last dup, but with the downside of highlighting only one user... and by the time it evolved to the above, I didn't see that `\G` wasn't being used at all. If length is a factor, the above can lose one more char and give the same results. Will edit. – Jerry Jul 26 '14 at 05:31
That happens to me all the time, my final expression often has "fat" that was needed at an earlier stage, but not when I step back... But often the one to step back is someone else looking at it. :) – zx81 Jul 26 '14 at 05:34

Match Partially Duplicated Lines

2 Answers2

Linked