Regex/Algorithm to find 'n' repeated lines in a file

Question

I am looking for an advanced version of this.

Basically, if I have a file with text:

abc
ghi
fed
jkl
abc
ghi
fed

I want the output to be:(for n=3)

Duplicated Lines
abc
ghi
fed
Times = 2

I am not looking for a language specific solution, although, for the sake of your question let's say Python. — s4san, Jun 12 '15 at 15:18
@s4san Do the lines in the output have to be in the order in which they appear in the input? In your example, would it be bad if you got `abc\nfed\nghi\n` in the output? If not, you could simply use the UNIX utilities `sort` and `uniq`. — jub0bs, Jun 12 '15 at 15:24
@Jubobs The output needs to be ordered as I have to extract the duplicates. — s4san, Jun 12 '15 at 15:26
How much memory is available compared to the size of the file? If you don't have an actual file right now, estimate worst case scenarios. — Lasse V. Karlsen, Jun 12 '15 at 15:26
@s4san `uniq` would remove the duplicates for you (if you apply `sort` beforehand). — jub0bs, Jun 12 '15 at 15:27
What if matches overlap, do you want to count them? For instance, if I can represent lines as characters as an example, with N=3, how many duplicates would this string produce? `ABABABA`? — Lasse V. Karlsen, Jun 12 '15 at 15:28
@s4san You write: *I have to preserve the non-duplicate lines*. Yet that's not what the output of your example reflects: it only lists duplicate lines. — jub0bs, Jun 12 '15 at 15:29
@LasseV.Karlsen A string with repeated characters would be considered a single line(ABABABA = 1line, since there is no `CR`) and the goal would be to find another line that matches the same. — s4san, Jun 12 '15 at 15:31
I understand that, please think of each character as one line since it is hard to write lines in a comment. If you had two lines that occurred interleaved like that, how many duplicates would you have? — Lasse V. Karlsen, Jun 12 '15 at 15:31
@Jubobs the output is just a listing of a particular instance of duplicated lines and not the actual file. I can run a simple search after receiving this output to extract all duplicated lines. — s4san, Jun 12 '15 at 15:32
@s4san Then what are you looking for, exactly? Your question is making less and less sense... — jub0bs, Jun 12 '15 at 15:33
@sln `n` represents number of lines to search for, which could be duplicated in the rest of the file. — s4san, Jun 12 '15 at 15:36
He is asking how to efficiently find sequences of lines that occur more than once. The 3 lines abc, ghi, fed, in that order, occur twice in the example, thus the final output. His question is poorly worded, I agree, but that is what I understood the question as. — Lasse V. Karlsen, Jun 12 '15 at 15:36
In a program like Python I would gather `n` and `n` lines and stuff them into tuples and use those as keys to a dictionary to find and count duplicates. I would basically read the first `n` lines, create a tuple of that, then remove the first line and read the next line, create a new tuple of the `n` lines I now have, and repeat. — Lasse V. Karlsen, Jun 12 '15 at 15:37
@Jubobs as the question says, I just want to find `n` lines in a given file that have duplicates(same `n` lines repeated) in the rest of the file, and print the `n` lines that have duplicates. — s4san, Jun 12 '15 at 15:38
Ok, well in that case, or any case, it can't be done with regex because regex can't count, or doesn't make a count available to you. There is an exception - this can be done in Perl regex with _code construct_. Or, possibly using Dot-Net's _Capture Collection_ within a lookahead. — , Jun 12 '15 at 15:39
@LasseV.Karlsen Assuming your example has multiple lines, the output would be `No duplicate lines` because line with `AB` has repeated thrice but has not done so, more than once. — s4san, Jun 12 '15 at 15:42
I don't understand then, `AB` is not one line, it is two lines, `A` and `B`. As such I would think that `A,B,A` would occur twice. — Lasse V. Karlsen, Jun 12 '15 at 15:45
The other issue with this is there could be overlap (if n > 1) of duplicates. But, what are you going to do with that info. In regex, assuming you can get a count, you would consume a single line at a time, using lookaheads to find dups starting at that line, next line, next, etc.. — , Jun 12 '15 at 15:46

Mazdak · Accepted Answer · 2015-06-12T15:44:01.577

One way is splitting your text based on your n then count the number of your elements that all is depending this counting you can use some data structures that use hash-table like dictionary in python that is much efficient for such tasks.

The task is that you create a dictionary that keeps the keys unique and then loop over the list of splitted text and increase the count of each item every time you see a duplicate.

At last you'll have a dictionary contain the unique items with those count as the values of dictionary.

Some langs like python provides good tools like Counter for count the elements within an iterable and islice for slicing and iterable that returns a generator and is very efficient for long iterables :

>>> from collections import Counter
>>> from itertools import islice

>>> s="""abc
... ghi
... fed
... jkl
... abc
... ghi
... fed"""
>>> sp=s.split()
>>> Counter('\n'.join(islice(sp,i,i+3)) for i in range(len(sp)))
Counter({'abc\nghi\nfed': 2, 'fed': 1, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1})

Or you can do it custom :

>>> a=['\n'.join(sp[i:i+3] for i in range(len(sp))]
>>> a
['abc\nghi\nfed', 'ghi\nfed\njkl', 'fed\njkl\nabc', 'jkl\nabc\nghi', 'abc\nghi\nfed', 'ghi\nfed', 'fed']
>>> d={}
>>> for i in a:
...    if i in d:
...       d[i]+=1
...    else :
...       d[i]=1
... 
>>> d
{'fed': 1, 'abc\nghi\nfed': 2, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1}
>>>

So, the keys here would be lines and I start the search with n = 1 and find all duplicates of n=1, then do a recursion by incrementing n? — s4san, Jun 12 '15 at 15:35
Can you edit your answer, it says 'test' so I got confused :) — s4san, Jun 12 '15 at 15:43

Sobrique · Answer 2 · 2015-06-12T15:53:39.413

1

So, something like this (in perl):

#!/usr/bin/perl
use strict;
use warnings;

my %seen; 
my @order; 

while ( my $line = <DATA> ) {
   chomp ( $line ); 
   push ( @order, $line ) unless $seen{$line}++; 

}

foreach my $element ( @order ) { 
    print "$element, $seen{$element}\n" if $seen{$element} > 1;
}

__DATA__
abc
ghi
fed
jkl
abc
ghi
fed

This can turn into a shorter snippet by:

perl -e 'while ( <> ) { push ( @order, $_ ) unless $seen{$_}++; } for (@order) {print if $seen{$_} > 1}' myfile

edited Jun 12 '15 at 15:53

answered Jun 12 '15 at 15:45

Sobrique

52,974
7
60
101

1

Its too easy using hashes. I think the OP wants a painful solution. – Jun 12 '15 at 15:49
Well, I can always onelinerify it to make it proper "write only" perl. – Sobrique Jun 12 '15 at 15:55

Regex/Algorithm to find 'n' repeated lines in a file

2 Answers2