python: remove duplicate groups of lines of text

Question

I know how to remove duplicate lines and duplicate characters from text, but I'm trying to accomplish something more complicated in python3. I have text files that might or might not contain groups of lines that are duplicated within each text file. I want to write a python utility that will find these duplicate blocks of lines and remove all but the first one found.

For example, suppose file1 contains this data:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.

I want the following to be the result of this transformation:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.




That's all, folks.

I also want this to work when the duplicate groups of lines are found starting somewhere other than at the beginning of the file. Suppose file2 looks like this:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
All around
the mulberry bush
the monkey chased the weasel.
... and this is another phrase.

All around
the mulberry bush
the monkey chased the weasel.

End

For file2, this should be the result of the transformation:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
... and this is another phrase.


End

To be clear, the potentially duplicated groups of lines are not known before running this desired utility. The algorithm would have to identify these duplicated groups of lines, itself.

I'm sure that with enough work and enough time, I can eventually come up with the algorithm I'm looking for. But I'm hoping that someone might have already solved this problem and posted the results somewhere. I have been searching and haven't found anything, but perhaps I have overlooked something.

ADDENDUM: I need to add more clarity. The groups of lines must be the largest sized groups, and each group must contain a minimum of 2 lines.

For example, suppose file3 looks like this:

line1 line1 line1
line2 line2 line2
line3 line3 line3

other stuff

line1 line1 line1
line3 line3 line3
line2 line2 line2

In this case, the desired algorithm will not remove any lines.

And another example, in file4:

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

The result I'm looking for would be this:

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

In other words, since the 4-line group (with "line1 ... line2 ... line3 ... line4 ...") is the largest one that is duplicated, that is the only group that is removed.

I could always repeat the process until the file is unchanged, if I then want the smaller duplicate groups to also be removed.

from the second example, it looks like you want to remove duplicate lines only, what makes a set of lines a group? does it have to be in between empty spaces to be considered a group and other groups that contain the preceding are to be clipped like in the example 2? I think you have to state your problem much more precisely to get an answer you need — Derte Trdelnik, Dec 23 '17 at 22:34
Yes, my original question was unclear. See the `ADDENDUM` that I appended to my original post. — HippoMan, Dec 23 '17 at 22:39
the issue here is identifying "groups", in your first example there is a blank line separating them, which make it easy, in the last one there is no such thing, which well make thing more harder... what are the rules for something to be a "group"? with the current info, what I can think off is try all sizes from 2 to half the larger non-empty consecutive lines... — Copperfield, Dec 24 '17 at 00:06
A group is anything consisting of 2 or more lines. Empty lines are treated the same as lines that contain data. For example, "line1 ... ... line2 ... line3" could be a group if that pattern appears more than once in the file. Yes, iterating over and over (or recursing) for all possible groups from 2 to half the number of lines (empty or otherwise) could work. I'm wondering if there's something more efficient. — HippoMan, Dec 24 '17 at 02:24
from your "file 4" example you are trying to find the longest(in number of lines) substring that appears at least twice - this can be done using suffix trees with some modification to count endline symbols if this is what you want, refer to https://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string — Derte Trdelnik, Dec 24 '17 at 09:29
Thank you. I'm now exploring this suffix-tree solution, and I will post my results once I code them up. — HippoMan, Dec 24 '17 at 15:49

HippoMan · Accepted Answer · 2017-12-25T22:16:26.103

I came up with the following solution. It might still have some unaccounted-for edge cases, and it might not be the most efficient way to do this, but at least after my preliminary testing, it seems to work.

This repost already fixes some bugs in my originally submitted version.

Any suggestions for improvement are welcome.

# Remove all but the first occurrence of the longest                                                                            
# duplicated group of lines from a block of text.
# In this utility, a "group" of lines is considered
# to be two or more consecutive lines.                                                                             
#                                                                                                                               
# Much of this code has been shamelessly stolen from                                                                            
# https://programmingpraxis.com/2010/12/14/longest-duplicated-substring/                                                        

import sys

from itertools import starmap, takewhile, tee
from operator import eq, truth

# imap and izip no longer exist in python3 itertools.                                                                           
# These are simply equivalent to map and zip in python3.                                                                        
try:
    # python2 ...
    from itertools import imap
except ImportError:
    # python3 ...
    imap = map
try:
    # python2 ...
    from itertools import izip
except ImportError:
    # python3 ...
    izip = zip

def remove_longest_dup_line_group(text):
    if not text:
        return ''
    # Unlike in the original code, here we're dealing                                                                           
    # with groups of whole lines instead of strings                                                                              
    # (groups of characters). So we split the incoming                                                                          
    # data into a list of lines, and we then apply the                                                                          
    # algorithm to these lines, treating a line in the
    # same way that the original algorithm treats an
    # individual character.                                                                                                       
    lines = text.split('\n')
    ld = longest_duplicate(lines)
    if not ld:
        return text
    tokens = text.split(ld)
    if len(tokens) < 1:
        # Defensive programming: this shouldn't ever happen,                                                                    
        # but just in case ...                                                                                                  
        return text
    return '{}{}{}'.format(tokens[0], ld, ''.join(tokens[1:]))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a,b)

def prefix(a, b):
    count = sum(takewhile(truth, imap(eq, a, b)))
    if count < 2:
        # Blocks must consist of more than one line.
        return ''
    else:
        return '{}\n'.format('\n'.join(a[:count]))

def longest_duplicate(s):
    suffixes = (s[n:] for n in range(len(s)))
    return max(starmap(prefix, pairwise(sorted(suffixes))), key=len)

if __name__ == '__main__':
    text = sys.stdin.read()
    if text:
        # Use sys.stdout.write instead of print to
        # avoid adding an extra newline at the end.
        sys.stdout.write(remove_longest_dup_line_group(text))
    sys.exit(0)

how to run this program? @HippoMan – gkiranreddy Oct 30 '20 at 19:29 — gkiranreddy, Oct 30 '20 at 19:29

score 0 · Answer 2 · answered Aug 01 '21 at 21:16

Quick and dirty, not tested for edge cases:

#!/usr/bin/env python3

from pathlib import Path

TEXT = '''Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.'''

def remove_duplicate_blocks(lines):
    num_lines = len(lines)

    for idx_start in range(num_lines):
        idx_end = num_lines

        for idx in range(idx_end, -1, -1):
            if idx_start < idx:
                dup_candidate_block = lines[idx_start + 1: idx]
                len_dup_block = len(dup_candidate_block)
                if len_dup_block and len_dup_block < int(num_lines / 2):
                    for scan_idx in range(idx):
                        if ((idx_start + 1) > scan_idx
                                and dup_candidate_block == lines[scan_idx: scan_idx + len_dup_block]):
                            lines[idx_start + 1: idx] = []
                            return remove_duplicate_blocks(lines)
    return lines


if __name__ == '__main__':
    clean_lines = remove_duplicate_blocks(TEXT.split('\n'))
    print('\n'.join(clean_lines))

OUTPUT:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.
And this is even different stuff.
That's all, folks.

python: remove duplicate groups of lines of text

2 Answers2

OUTPUT:

Linked