matching across line breaks and multiple line breaks

Question

I have a list of two character values, each on its own line in Notepad++. I am trying to eliminate the duplicates, but what I have written is only matching characters that are one line apart.

So if my list looks like this:

ME, <- not matched
OR,   |
ME, <- not matched
RI,
IL,
SD,
NV,
VA,
VA,
NY,
MN,
IL,
CA,
MI,
MO, <- match
MO, <- match

Right now I am using this. How can I modify it so it finds duplicated results more that one line apart as well

((\w{2}).*(\r\n)(\2))+

EDIT

((\w{2}).*(\r\n))(.*\r\n)+\1 This seems to work a bit better.

Do you need to keep the original order of the matches? Can there be more than duplicates? — Tim Pietzcker, Sep 16 '13 at 20:24
@TimPietzcker Order does not matter. What do you mean more than duplicates? Thanks much! — 1252748, Sep 16 '13 at 20:27
I meant triplicates etc. - well, if order doesn't matter, can't you just sort the lines and remove the duplicates then? — Tim Pietzcker, Sep 16 '13 at 20:27
@TimPietzcker Definitely. I've seen some up to has much as seven — 1252748, Sep 16 '13 at 20:29

score 0 · Answer 1 · answered Sep 16 '13 at 20:27

if you check the checkbox "dot matches newline", you will get three matches:

ME, <-  matched
OR,   |
ME, <-  matched
RI,
IL, <-  matched
SD,   |
NV,   |
VA,   |
VA,   |
NY,   |
MN,   |
IL, <-  matched
CA,
MI,
MO, <- matched
MO, <- matched

but this won't help you to remove duplicates..

score 0 · Answer 2 · edited Feb 08 '17 at 14:47

0

(\w{2}),[^\1]*(\1),

Regular expression visualization

Debuggex Demo

This i believe is the closest you'll ever get it.

EDIT: I LIED, lol this will work. I'm not sure what language you are using but so ill give you psuedo code.

Essentially,

pattern = "(\w{2}),[^]*(\1),";
compile(pattern);
while(match(pattern, input)){
     //replace input's group 2 with a "" and remove /r/n
}

This will keep running through the code until you have no duplicates left.

edited Feb 08 '17 at 14:47

Community

1
1

answered Sep 16 '13 at 20:37

progrenhard

2,333
2
14
14

What is `[^]` ? I get `(\w{2}),[ <-- Unbalanced '[' ^]*(\1),` – Sep 17 '13 at 00:39
`[^\1]` means any character that is not `1` – Sep 17 '13 at 00:43
I'm sorry, `[^\1]` means any char that is not octal \001, and is not a backref to capture group 1. – Sep 17 '13 at 01:05

score 0 · Answer 3 · answered Sep 16 '13 at 20:47

Maybe that's not the preferred answer, but I would write a small python script to accomplish this task...

my_file = """ME,
OR,
ME,
RI,
IL,
SD,
NV,
VA,
VA,
NY,
MN,
IL,""" #replace by my_file = file("filename.txt", "r")
my_set = set()
for line in my_file.splitlines():
    my_set.add(line)
print my_set #just for demonstartion
out_file = file("C:\\Users\\burgert\\Desktop\\outfile.txt", "w")
for s in my_set:
    s += "\n"
    out_file.writelines(s)
out_file.close()

matching across line breaks and multiple line breaks

3 Answers3