Delete repeated text between delimeters

Question

I have a data file for fortune that contains many repeated fortunes. I would like to remove them.

Fortunes are delineated by %'s, so a sample fortune file may look like this:

%
This is sample fortune 1
%
This is 
sample fortune 2
%
This fortune 
is repeated
%
This is sample fortune 3
%
This fortune 
is repeated
%
This fortune
is unique
%

As you can see, fortunes can span across multiple lines, rendering the solutions here useless.

What can I do to find and remove the repeated fortunes? I thought about just finding a way to make awk ignore lines beginning with %, but some fortunes share identical lines but are not the same overall (such as the last two in my example), so that is not enough.

I've been trying to solve this with awk so far, but any tool is fine.

see also answer https://stackoverflow.com/questions/33744733/fortune-with-m-exclude-string — joeljpa, Jul 24 '23 at 09:22

hek2mgl · Accepted Answer · 2015-11-03T20:56:58.800

4

That's a job for awk:

awk 'seen[$0]{next}{seen[$0]=1}1' RS='%' ORS='%' fortune

RS='%' means we are using % as the record separator.

seen[$0] checks if we already have seen this value. $0 is the whole record, the fortune's text, as string. If we've seen the value we are moving to the next record and don't print anything.

{seen[$0]=1} adds the record to the lookup table. 1 prints the current record since it is always true. Note that this code gets only executed when we've not seen the record before, because of the next statement before.

ORS='%' set's the output record separator to %.

edited Nov 03 '15 at 20:56

answered Nov 03 '15 at 20:46

hek2mgl

152,036
28
249
266

I was not aware of the `RS` and `ORS` variables. That was key thing I was missing. – SnoringFrog Nov 03 '15 at 20:57
1

Indeed, they are very powerful! I suggest to always think about *records* instead of *lines* – hek2mgl Nov 03 '15 at 21:13

score 4 · Answer 2 · answered Nov 03 '15 at 20:46

Awk can handle it. Set the record separator to "%\n" and then print unique entries:

awk 'BEGIN{RS="%\n"} { if (! ($0 in fortunes)) { fortunes[$0]++; print $0 "%"} }' data
%
This is sample fortune 1
%
This is 
sample fortune 2
%
This fortune 
is repeated
%
This is sample fortune 3
%
This fortune
is unique
%
$

Delete repeated text between delimeters

2 Answers2