Delete lines by multiple patterns in specific range of lines

Question

I have the following (simplified) file:

 <RESULTS>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,3</COLUMN>
  </ROW>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,1</COLUMN>
  </ROW>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,2</COLUMN>
  </ROW>
</RESULTS>

What I am trying to achieve is to delete all ROW elements that match on the title, but do not match on the latest VERSION (in this case 1,3). So, what I have in mind is something like the following with sed:

sed -i '/<ROW>/,/<\/ROW>/<COLUMN NAME=\"TITLE\">title 1.*<COLUMN NAME=\"VERSION\">^1,3<\/COLUMN>/d' file

The expected output should be the following:

<RESULTS>
<ROW>
  <COLUMN NAME="TITLE">title 1</COLUMN>
  <COLUMN NAME="VERSION">1,3</COLUMN>
</ROW>
</RESULTS>

Unfortunately, this did not work, neither did anything that I tried. I searched a lot for similar issues, but nothing worked for me. Is there a way of achieving it with any Linux command line utility (sed, awk, etc)?

Thanks a lot in advance.

Is using a proper xml tool an option? https://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell — rethab, Sep 27 '20 at 14:43

score 2 · Answer 1 · answered Sep 27 '20 at 14:47

/<ROW>/,/<\/ROW>/ won't work, because sed uses greedy matching; it matches everything from the first /<ROW>/ to the last /<\/ROW>/.

You'll have to use one of the advanced features of sed. The simplest is probably the hold space.

This:

sed -n '/<ROW>/{h;d;};H;`

will store an entire ROW block in the hold space, and overwrite it when it encounters a new ROW block. (And print nothing.)

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;p;}

will store the entire ROW block, then print it out when it is complete.

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;p;}'

will do the same, but will delete a block that does not contain "title 1".

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;/1,3/p;}'

will do the same, but print only if the block contains "1,3". (You can spell out the matching lines more explicitly; I'm trying to keep this code concise.)

Could you please explain more about `{h;d;};H;` and `{g;/title 1/!d;p;}'`? — Thân LƯƠNG Đình, Sep 27 '20 at 15:08
@ThanLUONG: `//{h;d;}` means *"if it contains ROW, put this line in the hold space (overwriting whatever was there), then delete the line (and begin again with the next line)".* `/title 1/!d` means *"if it doesn't contain 'title 1', delete it (and begin again with the next line)".* — Beta, Sep 27 '20 at 15:22

score 2 · Answer 2 · answered Sep 28 '20 at 00:10

This might work for you (GNU sed):

sed '/<ROW>/{:a;N;/<\/ROW>/!ba;/TITLE.*title 1/!b;/VERSION.*1,3/b;d}' file

Gather up lines between <ROW> and </ROW>.

If the lines collected don't contain the correct title, bail out.

If the lines collected do contain the correct version bail out.

Otherwise delete the lines collected.

Delete lines by multiple patterns in specific range of lines

2 Answers2