I know how to remove duplicate lines and duplicate characters from text, but I'm trying to accomplish something more complicated in python3. I have text files that might or might not contain groups of lines that are duplicated within each text file. I want to write a python utility that will find these duplicate blocks of lines and remove all but the first one found.
For example, suppose file1
contains this data:
Now is the time
for all good men
to come to the aid of their party.
This is some other stuff.
And this is even different stuff.
Now is the time
for all good men
to come to the aid of their party.
Now is the time
for all good men
to come to the aid of their party.
That's all, folks.
I want the following to be the result of this transformation:
Now is the time
for all good men
to come to the aid of their party.
This is some other stuff.
And this is even different stuff.
That's all, folks.
I also want this to work when the duplicate groups of lines are found starting somewhere other than at the beginning of the file. Suppose file2
looks like this:
This is some text.
This is some other text,
as is this.
All around
the mulberry bush
the monkey chased the weasel.
Here is some more random stuff.
All around
the mulberry bush
the monkey chased the weasel.
... and this is another phrase.
All around
the mulberry bush
the monkey chased the weasel.
End
For file2
, this should be the result of the transformation:
This is some text.
This is some other text,
as is this.
All around
the mulberry bush
the monkey chased the weasel.
Here is some more random stuff.
... and this is another phrase.
End
To be clear, the potentially duplicated groups of lines are not known before running this desired utility. The algorithm would have to identify these duplicated groups of lines, itself.
I'm sure that with enough work and enough time, I can eventually come up with the algorithm I'm looking for. But I'm hoping that someone might have already solved this problem and posted the results somewhere. I have been searching and haven't found anything, but perhaps I have overlooked something.
ADDENDUM: I need to add more clarity. The groups of lines must be the largest sized groups, and each group must contain a minimum of 2 lines.
For example, suppose file3
looks like this:
line1 line1 line1
line2 line2 line2
line3 line3 line3
other stuff
line1 line1 line1
line3 line3 line3
line2 line2 line2
In this case, the desired algorithm will not remove any lines.
And another example, in file4
:
abc def ghi
jkl mno pqr
line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz
The result I'm looking for would be this:
abc def ghi
jkl mno pqr
line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz
In other words, since the 4-line group (with "line1 ... line2 ... line3 ... line4 ...") is the largest one that is duplicated, that is the only group that is removed.
I could always repeat the process until the file is unchanged, if I then want the smaller duplicate groups to also be removed.