2

I have a text file where a particular set of consecutive lines appear again and again. I need to trim all the duplicate occurrences and just print the first occurrence alone.

Input:

$ cat log_repeat.txt
total bytes = 0, at time = 1190554
time window = 0, at time = 1190554
BW in Mbps = 0, at time = 1190554
total bytes = 0, at time = 1190554
time window = 0, at time = 1190554
BW in Mbps = 0, at time = 1190554
total bytes = 0, at time = 1190554
time window = 0, at time = 1190554
BW in Mbps = 0, at time = 1190554
total bytes = 0, at time = 1190554
time window = 0, at time = 1190554
BW in Mbps = 0, at time = 1190554
total bytes = 0, at time = 1190554
time window = 0, at time = 1190554
BW in Mbps = 0, at time = 1190554

$

The below Perl solution works only when there are odd occurrences,

$ perl -0777 -pe 's/(^total.*)\1//gms ' log_repeat.txt
total bytes = 0, at time = 1190554
time window = 0, at time = 1190554
BW in Mbps = 0, at time = 1190554

$

and prints nothing when there are even occurrences. How do I get the first occurrence irrespective of the section repeating odd or even times.

stack0114106
  • 8,534
  • 3
  • 13
  • 38
  • You can simply load all lines in a array, use `uniq()` function, and then print all elements in array, this question can help you https://stackoverflow.com/questions/7651/how-do-i-remove-duplicate-items-from-an-array-in-perl – Mobrine Hayde Mar 01 '19 at 12:32
  • @MobrineHayde.. no, I need to get them in order.. also the section can span many lines.. in the given sample, it spans across 3 lines.. – stack0114106 Mar 01 '19 at 12:35

2 Answers2

2

Match your block, multiple times greedily, as long as all that is followed by yet another

perl -0777 -wpe's/(total.*)+(?=\1)//s' log_repeat.txt

The lookahead ensures that one (last one) remains since it doesn't consume its match.

Or, keep the first match, by discarding it with \K, and remove others

perl -0777 -wpe's/(total.*?)\K\1+//s' log_repeat.txt

Note that .*? that must be used here has differences with .*, while probably not practical ones.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • I left out `^` (and thus `/m`) as it takes a great conspiracy to have another `total` inside a line _and_ the same pattern between pairs of them; it's kinda a little impossible -- or, it doesn't make sense. However, the `^` _is_ informative there and it doesn't hurt adding it. – zdim Mar 01 '19 at 18:59
1

The problem is that the substitution s/(^total.*)\1//gms deletes pairs of blocks. You can fix this by only deleting a single block at a time using a lookahead:

perl -0777 -pe 's/(^total.*)(?=\1)//gms' log_repeat.txt
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174