4

Using sed and/or awk, I'd like to be able to delete a line only if it contains the string "foo" AND the lines before and after contain the strings "bar" and "baz" respectively.

So for this input:

blah
blah
foo
blah
bar
foo
baz
blah

we would delete the second foo but nothing else, leaving:

blah
blah
foo
blah
bar
baz
blah

I've tried using a while loop to read the file line by line, but this is slow and I can't work out how to match the previous and next lines.

Edit - as requested in a comment, this is the current state of my while loop. Currently only matches the previous line (stored from the previous loop as $linepre).

linepre=0 
while read line
do 
   if [ $line != foo ] && [ $linepre != bar ]
   then 
       echo $line
   fi
linepre=$line
done < foobarbaz.txt

Pretty ugly.

birac
  • 568
  • 1
  • 8
  • 18

5 Answers5

5

For an elegant perl solution see Sundeep's answer.

For a similar and very nice sed solution see potong's second answer

Both solutions read the file completely into memory and process it in one go. This is fine if you don't need to process GB file sizes. In other words, these are the best solutions (if we ignore CASE3).

comment: both solutions fail CASE3 (see below). CASE3 is an exceptional debatable case.


Update 1: the following awk solution is a new script which works in all cases. Earlier solutions, for which this answer got accepted failed on particular cases. The presented solution solves the nested grouping (CASE3 below):

awk 'BEGIN{p=1;l1=l2=""}
     (NR>2) && p {print l1}
     { p=!(l1~/bar/&&l2~/foo/&&/baz/);
       l1=l2;l2=$0
     }
     END{if (l1!="" && p) print l1
         if (l2!=""     ) print l2}' <file>

To solve the problem, we constantly buffer 3 lines stored in l1, l2 and $0. Each processing of a new line, we determine if l1 should be printed or not in the next cycle and swap the buffered lines. The printing starts only from NR=3 onward. The condition to print is if l1 contains bar, l2 contains foo and $0 contains baz, then we do not print in the next cycle.

Update 2: A sed solution based on the same principle can be obtained. sed has two memories. The pattern space is where you do all operations on and the hold space is a long term memory. The idea is to put the word print in the hold space, but we can only do this by swapping the spaces around (using x)

 sed '1{x;s/^.*$/print/;x;N};                           #1
      N;                                                #2
      x;/print/{z;x;P;x};x;                             #3
      /bar.*\n.*foo.*\n.*baz/!{x;s/^.*$/print/;x};      #4
      $s/\(bar.*\)\n.*foo.*\n\(.*baz\)/\1\n\2/;         #5
      D' <file>                                         #6
  • line #1 initializes the state by placing the word print in the hold space (x;s...;x)and append another line to the pattern space (N)
  • line #2 adds the third line to the pattern space
  • line #3 determines if we need to print the first line of the pattern space by checking the hold space and delete the hold space P prints upto the first \n in the pattern space and z zaps the pattern space
  • line #4 determines if we should print in the next cycle. checks if the real pattern matches, if not put the word print in the hold space
  • line #5, is the end-of-file condition
  • line #6 deletes upto the first \n in the pattern space and goes back to #1 without reading a new line.

At exit, the pattern-space is printed again.

comment: if you want to see how the pattern space and hold space look like, you can add after each line the following code: s/^/P:/;l;s/^P://;x;s/^/H:/;l;s/^H://;x. This line will print both spaces with P: respectively H: in front.

Used test file:

# bar-foo-baz test file
# An asterisk indicates the foo
# lines that should be removed
<CASE0 :: default case>
bar
foo (*)
baz
<CASE1 :: reset cycle on second line>
bar
foobar
foo (*)
baz
<CASE2 :: start cycle at end of previous cycle>
bar
foo (*)
bazbar
foo (*)
baz
<CASE3 :: nested cases>
bar
foobar (*)
foobaz (*)
baz
<CASE4 :: end-of-file case>
bar
foo

Formerly accepted answer: (updated to indicate which cases fail)

awk: fails CASE3

awk '!/baz/&&(c==2){print foo}
     /bar/         {c=1;print;next}
     /foo/ &&(c==1){c++;foo=$0;next}
                   {c=0;print}
     END{if(c==2){print foo}}' <file>

This solution prints all lines by default, except if the line contains foo which comes after a line containing bar. The logic above just decides if we should print the line foo or not.

  • !/baz/&&(c==2){print foo} : this solves early termination. If no baz is found after a valid bar-foocombination, it prints the fooline.

  • /bar/{c=1;print;next} : this initialises the start of a new cycle. If bar is found, set c to 1, print the line and move to the next line. barlines are always printed. This line resolves CASE1 and CASE2.

  • /foo/&&(c==1){c++;foo=$0;next} : this checks the bar-foocombination. It stores the the fooline and moves to the next line.

  • {c=0;print}, if we reached this point, it implies that we did not find a barline or a bar-foocombination. Just print the line by default and reset the counter to zero.

  • END{if(c==2){print foo}} this statement just solves CASE4

gawk: fails CASE3

awk 'BEGIN{ORS="";RS="bar[^\n]*\n[^\n]*foo[^\n]*\n[^\n]*baz"}
     {sub(/\n[^\n]*foo[^\n]*\n/,"\n",RT); print $0 RT}' <file>

The RS is set to bar[^\n]*\n[^\n]*foo[^\n]*\n[^\n]*baz, i.e. the pattern we are interested in. Here, [^\n]*\n[^\n]* represents a string containing a single \n, thus the RS represents valid bar-foo-baz combination. The found record separator RT is edited with sub to remove the fooline and printed after the found record.

RT (gawk extension) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.

sed: fails CASE1, CASE2, CASE3, CASE4

sed -n '/bar/{N;/\n.*foo/{N;/foo.*\n.*baz[^\n]*$/{s/\n.*foo.*\n/\n/}}};p' <file>
  • /bar/{N;...} if the line contains bar, append the next line to the pattern buffer (N)
  • /\n.*foo/{N;...} if the pattern buffer has foo after a newline character, append the next line to the pattern buffer (N)
  • /foo.*\n.*baz[^\n]*$/{s/\n.*foo.*\n/\n/} if the pattern buffer contains foo followed by a single newline and ends with a line containing baz, remove the line containing foo. The search pattern here excludes cases as barfoo\nfoobaz\ncar
kvantour
  • 25,269
  • 4
  • 47
  • 72
3

Modified sample for more exotic cases:

$ cat ip.txt 
blah
bar
blah
foo
blah
bar
foo
baz
blah
bar
foobar
foo
baz
asdf

if perl is okay and input file is small enough to fit memory requirements

$ perl -0777 -pe 's/bar.*\n\K.*foo.*\n(?=.*baz)//g' ip.txt
blah
bar
blah
foo
blah
bar
baz
blah
bar
foobar
baz
asdf
  • -0777 to slurp entire input file
  • bar.*\n\K check if previous line contains bar
  • .*foo.*\n current line contains foo
  • (?=.*baz) next line contains baz
  • See lookarounds section in Reference - What does this regex mean? for more details on this regex. Here they ensure that overlapping matches across 3 lines are taken care
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • 2
    `perl` for the win! very nice! – kvantour Mar 05 '18 at 15:46
  • 1
    Wow - the Perl solution is so elegant! I'm definitely going to try and use Perl for these kind of things in the future. – birac Mar 05 '18 at 16:21
  • 1
    @Sundeep, there is a nested situation `bar \n foobar \n foobaz \n baz` which seems to fail. I would expect the two `foo`lines to be removed. It is however an exceptional case. – kvantour Mar 07 '18 at 19:02
  • @kvantour yeah, because `bar \n foobar \n foobaz` would result in regex engine moving to `foobaz` after removing `foobar` line... so, it won't be able to match the `bar` that was already removed.. not sure if regex can handle such cases :) – Sundeep Mar 08 '18 at 03:18
3

This might work for you (GNU sed):

sed ':a;/bar/!b;n;/foo/!ba;N;s/^.*\n\(.*baz\)/\1/;t;P;D' file

If the current line does not contain bar print it and begin a new cycle. Otherwise print the line containing bar and read the next line into the pattern space. If that line does not contain foo go back and check that it does not contain bar. Otherwise, append the next line to the current line (containing foo) and check if the appended line contains baz. If it does remove the first line containing foo and then print the line containing baz and begin a new cycle. Otherwise, the appended line does not contain baz so print the line containing foo and delete it and then check if the appended line contains bar.

An alternative, slurping the whole file into memory:

sed -zr 's/(bar[^\n]*)\n[^\n]*foo[^\n]*(\n[^\n]*baz)/\1\2/g' file
potong
  • 55,640
  • 6
  • 51
  • 83
2

Solution 1st: For same identical file(which you shown) without any further conditions following may help you on same then.

awk '/^bar/ && getline var ~ /^foo/ && getline var1 ~ /^baz/{print "bar" ORS "baz";next} 1'  Input_file

Solution 2nd: Following awk may help you on same.

awk '/bar/{val=FNR} /^foo/ && ++val==FNR{value=$0;getline;if($0 ~ /^baz/){print value ORS $0;val="";next} else {print value}} 1'    Input_file

Different permutations and combinations checks on above 2nd code:

Situation 1st: When string bar string foo and string baz comes then it will work fine.

Situation 2nd: When string bar comes and then string baz comes without foo then also it should work.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0

First variant - using sed

sed -r ':l; N; $!bl; s/(^|\nbar\n)foo\n(baz$|\n)/\1\2/g' input.txt

or, the same, but shorter and probably faster, by using the -z option:

sed -zr 's/(^|\nbar\n)foo\n(baz\n|$)/\1\2/g' input.txt

-z = separate lines by NUL characters. This option can be used for putting all text into the memory (if the text doesn't have NUL characters).

Second variant - using grep and sed

grep --color=always -Pz '\^|\nbar\n\Kfoo\n(?=baz\n)' input.txt | sed '/31m/d'

Both variants put all text in the memory before processing, so for large files they are not optimal.

Input

blah
blah
foo
blah
bar
foo
baz
blah

Output

blah
blah
foo
blah
bar
baz
blah
Community
  • 1
  • 1
MiniMax
  • 983
  • 2
  • 8
  • 24