12

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file:

AAAAA
AAAAA
AAAAA
BB
BBBBB
BBBBB
CCC
CCC
CCC

I would only need the following four lines from it:

AAAAA
BB
BBBBB
CCC

I'm using a text editor (EmEditor or Notepad++), that supports RegEx, not a programming language, so I must use a purely Regular Expression.

Any help?

EDIT: I checked the other thread that hsz mentioned and I'd like to make it clear that this one is not the same. Although both need to remove duplicate lines, the way to achieve it is different. I need pure RegEx, but the best answer from the other thread relies on a specific Notepad++ plug-in (which doesn't even come with it any more), so it's not even a regex solution. The second case there, is a regex and it does work on Notepad++, but not on EmEditor at all, which I also need. So I don't think my question is a repetition of that one, although that link is useful, an so I thank hsz for it.

zx81
  • 41,100
  • 9
  • 89
  • 105
Agos FS
  • 127
  • 1
  • 8
  • 1
    possible duplicate of [Removing duplicate rows in Notepad++](http://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad) – hsz Jul 14 '14 at 10:48
  • Are repeated lines grouped together? That is, can the file be AAAA BBBB AAAA BBBB so that you want make it AAAA BBBB? – Alexander Gelbukh Jul 14 '14 at 10:50
  • Answer to Gelbukh: The lines must be on the exact same order as they were originally. – Agos FS Jul 14 '14 at 11:48
  • Possible duplicate of [find duplicate lines and remove using regular expression with replace feature](https://stackoverflow.com/questions/1573361/find-duplicate-lines-and-remove-using-regular-expression-with-replace-feature) – ARIF MAHMUD RANA Jul 02 '17 at 09:33

4 Answers4

13

Two nearly identical options:

Match All Lines That Are Not Repeated

(?sm)(^[^\r\n]+$)(?!.*^\1$)

The lines will be matched, but to extract them, you really want to replace the other ones.

Replace All Repeated Lines

This will work better in Notepad++:

Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

Replace: empty string

  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • (^[^\r\n]*) captures a line to Group 1, i.e.
  • The ^ anchor asserts that we are at the beginning of the string
  • [^\r\n]* matches any chars that are not newline chars
  • [\r\n] matches the newline chars
  • The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
  • ^\1$ the same line as Group 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • Added an option, `Replace All Repeated Lines`, that will work better in a text editor since you want to "extract" the lines. – zx81 Jul 14 '14 at 11:08
  • Thank you very much. Your second RegEx (Replace All Repeated Lines) is what I need. The first one does the opposite (but might be useful, so let it be). It works equally on both EmEditor and Notepad++ as I need, however it does not remove the empty lines. :( I already tried adding '|^\n$' to the end, but it does nothing. If you could just help me with that, this would be the best answer. :) – Agos FS Jul 14 '14 at 11:32
  • Please see revised answer. If this works for you, please consider accepting the answer by clicking the checkmark on the left as this is now the rep system works on the site. Thanks! – zx81 Jul 14 '14 at 13:16
  • Perfect! Works well in both editors, exactly what I needed. I'm voting this for the best answer (hope the system accepts it. Last time it didn't because I'm new here). One simple last request: please switch the order of your answers, since the second is what the thread is all about. I fear some people might not vote you up because of that. ;-) – Agos FS Jul 14 '14 at 13:53
4

You can use the following regular expression to remove both repeated and empty lines.

Find: ^(.*)(\r?\n\1)+$
Replace: \1
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Thank you. Good solution but only works on Notepad++, as it is. I removed the question mark '?' to make it work on EmEditor, but still it only removes a few lines. I think this might be a bug of EmEditor (the program itself) not a fault of your code, so I consider this answer correct. However since I had to choose only one as the best, I chose the one from zx81, because his answer is detailed, it doesn't require any replacement (more practical) and also removes any empty line that might be in the original file (something I also needed), and of course, it works as is in both editors. – Agos FS Jul 14 '14 at 14:10
  • 1
    In VS Code use replace: `$1` and then "replace all". –  Aug 10 '18 at 09:09
0

Provided that the equal lines go in groups, that is, AAAA AAAA BBBB BBBB and not AAAA BBBB AAAA BBBB, in Perl notation, the following works:

s/(^.*$)(\r?\n\1$)*/$1/gm;

which means substitute /(^.$)(\r?\n\1$)/ for $1 globally and in multiline mode (^ and $ match internal \n).

This expression means that any complete line followed by any number of equal lines is substituted by a single occurrence.

See help on your particular editor for how to apply such a regex.

Alexander Gelbukh
  • 2,104
  • 17
  • 29
  • thanks, but this is not for a simple text editor as I requested. I've tried it without the final parts, but it still doesn't work either. – Agos FS Jul 14 '14 at 11:24
0

I don't know will it work in Notepad++ or EmEditor but working fine in PHP/JavaScript/Python with substitution.

^(.+)(\n(\1))*$

Here is Demo

Simply copy your text and get the final result from the link that I shared you.

Braj
  • 46,415
  • 5
  • 60
  • 76
  • 1
    Thanks for the link, the debuuger is useful. However, the regex needs to replace any char not just letters, and so it didn't do I actually needed. So I replaced the \w by . but now it cleasr everything in both EmEditor and Notepad++, although it "works" fine on the debugger... Maybe it's using a different regex standard... – Agos FS Jul 14 '14 at 11:22