0

I have a list of two character values, each on its own line in Notepad++. I am trying to eliminate the duplicates, but what I have written is only matching characters that are one line apart.

So if my list looks like this:

ME, <- not matched
OR,   |
ME, <- not matched
RI,
IL,
SD,
NV,
VA,
VA,
NY,
MN,
IL,
CA,
MI,
MO, <- match
MO, <- match

Right now I am using this. How can I modify it so it finds duplicated results more that one line apart as well

((\w{2}).*(\r\n)(\2))+

EDIT

((\w{2}).*(\r\n))(.*\r\n)+\1 This seems to work a bit better.

1252748
  • 14,597
  • 32
  • 109
  • 229

3 Answers3

0

if you check the checkbox "dot matches newline", you will get three matches:

ME, <-  matched
OR,   |
ME, <-  matched
RI,
IL, <-  matched
SD,   |
NV,   |
VA,   |
VA,   |
NY,   |
MN,   |
IL, <-  matched
CA,
MI,
MO, <- matched
MO, <- matched

but this won't help you to remove duplicates..

Kent
  • 189,393
  • 32
  • 233
  • 301
0
(\w{2}),[^\1]*(\1),

Regular expression visualization

Debuggex Demo

This i believe is the closest you'll ever get it.

EDIT: I LIED, lol this will work. I'm not sure what language you are using but so ill give you psuedo code.

Essentially,

pattern = "(\w{2}),[^]*(\1),";
compile(pattern);
while(match(pattern, input)){
     //replace input's group 2 with a "" and remove /r/n
}

This will keep running through the code until you have no duplicates left.

Community
  • 1
  • 1
progrenhard
  • 2,333
  • 2
  • 14
  • 14
0

Maybe that's not the preferred answer, but I would write a small python script to accomplish this task...

my_file = """ME,
OR,
ME,
RI,
IL,
SD,
NV,
VA,
VA,
NY,
MN,
IL,""" #replace by my_file = file("filename.txt", "r")
my_set = set()
for line in my_file.splitlines():
    my_set.add(line)
print my_set #just for demonstartion
out_file = file("C:\\Users\\burgert\\Desktop\\outfile.txt", "w")
for s in my_set:
    s += "\n"
    out_file.writelines(s)
out_file.close()
OBu
  • 4,977
  • 3
  • 29
  • 45