1

suppose I have a list of URLs and some of them are repeated any number of times, but some of them are unique. I need to get rid of the unique lines, (which are useless) and save the URLs which have been repeated more than 4 times (which are very important URLs for me to keep track of).

How can I make an expression of some sort which would delete all but the duplicate lines? I would prefer to be able to whittle it down to a list of only the URLs which are repeated more than 4 times.

Liam Meevs
  • 11
  • 2
  • That sounds like more of a job for `sort` and `uniq` command line utilities, not Notepad++. That said, Notepad++ does have a the TextFX plugin which support sorting. That may be a place to start. – Mr. Llama Mar 13 '15 at 21:50
  • 1
    Possible duplicate of https://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad?rq=1 – mittmemo Mar 13 '15 at 21:51
  • unfortunately that plugin can only remove duplicate lines, it can't do anything useful with them. – Liam Meevs Mar 13 '15 at 21:52
  • @Mr.Llama can you explain to me what steps I need to take to get and use a uniq command line utility to make these changes to my text files? It doesn't seem very clear to me at all when I looked into it – Liam Meevs Mar 13 '15 at 22:09

1 Answers1

2

If you slightly tweak this answer, replacing positive lookahead with negative lookahead, you get a regex that matches only to lines which do not have a duplicate line following them:

^(.*?)$\s+?^(?!.*^\1$)

Note: you need to lexicographically sort first. See the linked answer. If you run this 3 times, the remaining lines will be those that were repeated 4 or more times in the original.

Finally, you can just use Edit -> Line Operations -> Remove Consecutive Duplicate Lines to finish the job and give you just one line for each line that was duplicated 4 or more times in the original.

Colm Bhandal
  • 3,343
  • 2
  • 18
  • 29