Delete duplicate lines only if they match a pattern

Question

This question has a great answer saying you can use awk '!seen[$0]++' file.txt to delete non-consecutive duplicate lines from a file. How can I delete non-consecutive duplicate lines from a file only if they match a pattern? e.g. only if they contain the string "#####"

Example input

deleteme.txt ##########
1219:                            'PCM BE PTP'
deleteme.txt ##########
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225:                          , 'PCM FE/MID PTP'

Desired output

deleteme.txt ##########
1219:                            'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225:                          , 'PCM FE/MID PTP'

Please add sample input and your desired output for that sample input to your question. — Cyrus, Mar 03 '19 at 19:03
Make sure every question you post makes sense stand-alone and that the code you post in your question provides a [mcve] for this specific question, providing a link to some code in some other answer that probably does more than what your question is about isn't the best way to try to get people to help you. — Ed Morton, Mar 03 '19 at 19:10
If you get your file with grep and after sed. Add a awk at the end is not the better way. All can be done with awk. — ctac_, Mar 03 '19 at 19:29

Wiktor Stribiżew · Accepted Answer · 2019-03-04T07:48:24.783

You may use

awk '!/#####/ || !seen[$0]++'

Or, as Ed Morton suggests, a synonymical

awk '!(/#####/ && seen[$0]++)'

Here, !seen[$0]++ does the same thing as usual, it will remove any duplicated line. The !/#####/ part matches lines that contain a ##### pattern and negates the match. The two patterns combined with || will remove all duplicate lines having ##### pattern inside them.

See an online awk demo:

s="deleteme.txt ##########
1219:                            'PCM BE PTP'
deleteme.txt ##########
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223  #####:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225:                          , 'PCM FE/MID PTP'"
awk '!/#####/ || !seen[$0]++' <<< "$s"

Output:

deleteme.txt ##########
1219:                            'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223  #####:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225:                          , 'PCM FE/MID PTP'

I voted for this because it does exactly what was asked for and is simpler than my answer. — Jamin Kortegard, Mar 03 '19 at 21:58

score 2 · Answer 2 · answered Mar 04 '19 at 14:41

Try this Perl command line regex solution using file slurp mode.

perl -0777 -ne ' $z=$y=$_; 
                 while( $y ne $x) 
                 { $z=~s/(^[^\n]+?\s+##########.*?$)(.+?)\K(\1\n)//gmse ; $x=$y ;$y=$z } ; 
                 print "$z" '

with the given inputs

$ cat toucan.txt
deleteme.txt ##########
1219:                            'PCM BE PTP'
deleteme.txt ##########
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225:                          , 'PCM FE/MID PTP'

$ perl -0777 -ne ' $z=$y=$_; while( $y ne $x) { $z=~s/(^[^\n]+?\s+##########.*?$)(.+?)\K(\1\n)//gmse ; $x=$y ;$y=$z } ; print "$z" ' toucan.txt
deleteme.txt ##########
1219:                            'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225:                          , 'PCM FE/MID PTP'

$

++ This looks good though I would prefer awk for the sake of simplicity — anubhava, Mar 04 '19 at 18:55
thanks @anubhava..yes, the awk solution is simple and straight forward.. I was just trying if single regex s/// would solve it.?. — stack0114106, Mar 04 '19 at 18:58

Jamin Kortegard · Answer 3 · 2019-03-03T20:05:55.987

Whenever I think about matching patterns and selective printing, I think of the the Practical Extraction and Report Language: Perl! Here's a Perl one-liner that does what you're asking. You should be able to copy-paste this into a shell and have it work:

perl -wnle 'BEGIN { $rows_with_five_hashes = {}; } $thisrow = $_; if ($thisrow =~ /[#]{5}/) { if (!exists $rows_with_five_hashes->{$thisrow}) { print; } $rows_with_five_hashes->{$thisrow}++; } else { print; }' input.txt

Here's the same Perl with line breaks and comments for clarity (note: this isn't executable as-is):

BEGIN {
  # create a counter for rows that match the pattern
  $rows_with_five_hashes = {}; 
} 
# capture the row from the input file
$thisrow = $_;
if ($thisrow =~ /[#]{5}/) { 
  if (!exists $rows_with_five_hashes->{$thisrow}) { 
    # this row matches the pattern and we haven't seen it before
    print; 
  } 
  # Increment the counter for rows that match the pattern.
  # Do this AFTER we print, or else our "exists" print logic fails.
  $rows_with_five_hashes->{$thisrow}++;
} 
else { 
  # print all rows that don't match the pattern
  print;
}

Ruby has similar "one-liner" functionality for running code directly on the command line (much of which it borrowed from Perl).

For more info on the wnle command line switches, check out the Perl docs about that. If you had many files you wanted to modify in place and keep backup copies of the originals with a single Perl command, check out the -i switch in those docs.

If you found yourself running this all the time and wanted to keep a handy executable script, you could adapt this pretty easily to run on just about any system that has a Perl interpreter.

score 0 · Answer 4 · answered Mar 04 '19 at 16:38

This might work for you (GNU sed):

sed '/#$/{G;/^\(\S*\s\).*\1/!P;h;d}' file

All lines other than those of interest are printed as normal.

Append previous lines of interest to the current line and using pattern matching, if such a line has not been encounter before, print it. Then store the pattern space back in the hold space, ready for the next match and delete the pattern space.

Delete duplicate lines only if they match a pattern

4 Answers4

Linked