Linux command or script counting duplicated bunch of lines in a text file?

Question

I am looking for something like this, but instead of counting the number of duplicated lines I would need to count the number of duplicated bunch of lines.

For the sake of clarification, I have a file like this:

Separator
line11
line12
line13
Separator
line21
line22
line23
Separator
line11
line12
line13
Separator
line11
line12
line13
Separator
line31
line32
line33
Separator
line21
line22
line23

And I would excpect an output as follows

3:    Separator
      line11
      line12
      line13
2:    Separator
      line21
      line22
      line23
1:   Separator
      line31
      line32
      line33

Where: 3:,2: and 1: means the number of times each bunch of lines appears in the file.

I tried without success the following command:

sort all_lits.txt | uniq -c

and currently I am writing an awk command in order to obtain the information but nothing clear yet. As soon as I get some command to show I am going to publish it.

Is it possible to get this information using some combination of UNIX tools such as awk, grep, wc, sort. ect.?

I do know I could write a script to do it but I would like to avoid to do so. In the extreme case I will do.

Any help is going to be highly appreciated.

Try providing some more explanation about what 3:, 2:, 1: mean, together with your attempts. Do not expect people opening other questions and answers if you don't show a minimal effort on solving your problem. — fedorqui, Oct 01 '14 at 10:03
Finally I solved it using a python script using a dictionary and incrementing a counter each time I get a match. Thanks @fedorqui for motivating me to show what I did. Best.- — pafede2, Oct 01 '14 at 10:21
Nice! You may share the script in an answer, so that next people having similar problems can use it. — fedorqui, Oct 01 '14 at 10:25

glenn jackman · Accepted Answer · 2014-10-01T13:49:01.047

2

awk -v RS=Separator '
    NR>1 {count[$0]++}
    END {for (bunch in count) print count[bunch], RS, bunch}
' file

1 Separator 
line31
line32
line33

2 Separator 
line21
line22
line23

3 Separator 
line11
line12
line13

There is no inherent order to the output. If you want sorted by count descending, and you're using GNU AWK:

awk -v RS=Separator '
    NR>1 {count[$0]++}
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (bunch in count) print count[bunch], RS, bunch
    }
' file

edited Oct 01 '14 at 13:49

answered Oct 01 '14 at 10:35

glenn jackman

238,783
38
220
352

Even if I wrote my own script I also used the @glenn jackman awk command. Both of then work well. Actually, I used it for validating the functionality of my script. Thanks! – pafede2 Oct 01 '14 at 11:33
nit-pick - testing $0 that way will remove any blocks that contain just a string that has numeric value zero. You need to test for `NF` or `$0 != ""` or `/^[[:space:]]*$/` or similar to skip blank records but I think you probably just wanted to test for `NR>1` in this case to skip the empty record before the first Separator. +1 though. – Ed Morton Oct 01 '14 at 13:16

score 1 · Answer 2 · answered Oct 01 '14 at 10:29

This is the script I am using. It is still in testing time but it may be used as a base for other people:

with open(file_name, mode="r") as bigfile:
reader = bigfile.read()

d = dict()
for res in reader.split('Separator'):
  if res in d:
    d[res]= d[res]+1
  else:
    d[res]=1

for k in d:
  print str(k) + ':' + str(d[k])

Linux command or script counting duplicated bunch of lines in a text file?

2 Answers2