1

I am looking for an advanced version of this.

Basically, if I have a file with text:

abc
ghi
fed
jkl
abc
ghi
fed

I want the output to be:(for n=3)

Duplicated Lines
abc
ghi
fed
Times = 2
Community
  • 1
  • 1
s4san
  • 319
  • 2
  • 9
  • What kind of language you are using? – Mazdak Jun 12 '15 at 15:16
  • I am not looking for a language specific solution, although, for the sake of your question let's say Python. – s4san Jun 12 '15 at 15:18
  • @s4san Do the lines in the output have to be in the order in which they appear in the input? In your example, would it be bad if you got `abc\nfed\nghi\n` in the output? If not, you could simply use the UNIX utilities `sort` and `uniq`. – jub0bs Jun 12 '15 at 15:24
  • @Jubobs The output needs to be ordered as I have to extract the duplicates. – s4san Jun 12 '15 at 15:26
  • How much memory is available compared to the size of the file? If you don't have an actual file right now, estimate worst case scenarios. – Lasse V. Karlsen Jun 12 '15 at 15:26
  • @s4san `uniq` would remove the duplicates for you (if you apply `sort` beforehand). – jub0bs Jun 12 '15 at 15:27
  • @Jubobs I have to preserve the non-duplicate lines. – s4san Jun 12 '15 at 15:28
  • What if matches overlap, do you want to count them? For instance, if I can represent lines as characters as an example, with N=3, how many duplicates would this string produce? `ABABABA`? – Lasse V. Karlsen Jun 12 '15 at 15:28
  • @s4san You write: *I have to preserve the non-duplicate lines*. Yet that's not what the output of your example reflects: it only lists duplicate lines. – jub0bs Jun 12 '15 at 15:29
  • @LasseV.Karlsen A string with repeated characters would be considered a single line(ABABABA = 1line, since there is no `CR`) and the goal would be to find another line that matches the same. – s4san Jun 12 '15 at 15:31
  • I understand that, please think of each character as one line since it is hard to write lines in a comment. If you had two lines that occurred interleaved like that, how many duplicates would you have? – Lasse V. Karlsen Jun 12 '15 at 15:31
  • @Jubobs the output is just a listing of a particular instance of duplicated lines and not the actual file. I can run a simple search after receiving this output to extract all duplicated lines. – s4san Jun 12 '15 at 15:32
  • 2
    What does `n=3` mean? –  Jun 12 '15 at 15:33
  • @s4san Then what are you looking for, exactly? Your question is making less and less sense... – jub0bs Jun 12 '15 at 15:33
  • @sln `n` represents number of lines to search for, which could be duplicated in the rest of the file. – s4san Jun 12 '15 at 15:36
  • He is asking how to efficiently find sequences of lines that occur more than once. The 3 lines abc, ghi, fed, in that order, occur twice in the example, thus the final output. His question is poorly worded, I agree, but that is what I understood the question as. – Lasse V. Karlsen Jun 12 '15 at 15:36
  • In a program like Python I would gather `n` and `n` lines and stuff them into tuples and use those as keys to a dictionary to find and count duplicates. I would basically read the first `n` lines, create a tuple of that, then remove the first line and read the next line, create a new tuple of the `n` lines I now have, and repeat. – Lasse V. Karlsen Jun 12 '15 at 15:37
  • @Jubobs as the question says, I just want to find `n` lines in a given file that have duplicates(same `n` lines repeated) in the rest of the file, and print the `n` lines that have duplicates. – s4san Jun 12 '15 at 15:38
  • Ok, well in that case, or any case, it can't be done with regex because regex can't count, or doesn't make a count available to you. There is an exception - this can be done in Perl regex with _code construct_. Or, possibly using Dot-Net's _Capture Collection_ within a lookahead. –  Jun 12 '15 at 15:39
  • @LasseV.Karlsen Assuming your example has multiple lines, the output would be `No duplicate lines` because line with `AB` has repeated thrice but has not done so, more than once. – s4san Jun 12 '15 at 15:42
  • I don't understand then, `AB` is not one line, it is two lines, `A` and `B`. As such I would think that `A,B,A` would occur twice. – Lasse V. Karlsen Jun 12 '15 at 15:45
  • The other issue with this is there could be overlap (if n > 1) of duplicates. But, what are you going to do with that info. In regex, assuming you can get a count, you would consume a single line at a time, using lookaheads to find dups starting at that line, next line, next, etc.. –  Jun 12 '15 at 15:46

2 Answers2

1

One way is splitting your text based on your n then count the number of your elements that all is depending this counting you can use some data structures that use hash-table like dictionary in python that is much efficient for such tasks.

The task is that you create a dictionary that keeps the keys unique and then loop over the list of splitted text and increase the count of each item every time you see a duplicate.

At last you'll have a dictionary contain the unique items with those count as the values of dictionary.

Some langs like python provides good tools like Counter for count the elements within an iterable and islice for slicing and iterable that returns a generator and is very efficient for long iterables :

>>> from collections import Counter
>>> from itertools import islice

>>> s="""abc
... ghi
... fed
... jkl
... abc
... ghi
... fed"""
>>> sp=s.split()
>>> Counter('\n'.join(islice(sp,i,i+3)) for i in range(len(sp)))
Counter({'abc\nghi\nfed': 2, 'fed': 1, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1})

Or you can do it custom :

>>> a=['\n'.join(sp[i:i+3] for i in range(len(sp))]
>>> a
['abc\nghi\nfed', 'ghi\nfed\njkl', 'fed\njkl\nabc', 'jkl\nabc\nghi', 'abc\nghi\nfed', 'ghi\nfed', 'fed']
>>> d={}
>>> for i in a:
...    if i in d:
...       d[i]+=1
...    else :
...       d[i]=1
... 
>>> d
{'fed': 1, 'abc\nghi\nfed': 2, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1}
>>> 
Mazdak
  • 105,000
  • 18
  • 159
  • 188
1

So, something like this (in perl):

#!/usr/bin/perl
use strict;
use warnings;

my %seen; 
my @order; 

while ( my $line = <DATA> ) {
   chomp ( $line ); 
   push ( @order, $line ) unless $seen{$line}++; 

}

foreach my $element ( @order ) { 
    print "$element, $seen{$element}\n" if $seen{$element} > 1;
}

__DATA__
abc
ghi
fed
jkl
abc
ghi
fed

This can turn into a shorter snippet by:

perl -e 'while ( <> ) { push ( @order, $_ ) unless $seen{$_}++; } for (@order) {print if $seen{$_} > 1}' myfile
Sobrique
  • 52,974
  • 7
  • 60
  • 101