0

I have the following (simplified) file:

 <RESULTS>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,3</COLUMN>
  </ROW>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,1</COLUMN>
  </ROW>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,2</COLUMN>
  </ROW>
</RESULTS>

What I am trying to achieve is to delete all ROW elements that match on the title, but do not match on the latest VERSION (in this case 1,3). So, what I have in mind is something like the following with sed:

sed -i '/<ROW>/,/<\/ROW>/<COLUMN NAME=\"TITLE\">title 1.*<COLUMN NAME=\"VERSION\">^1,3<\/COLUMN>/d' file

The expected output should be the following:

<RESULTS>
<ROW>
  <COLUMN NAME="TITLE">title 1</COLUMN>
  <COLUMN NAME="VERSION">1,3</COLUMN>
</ROW>
</RESULTS>

Unfortunately, this did not work, neither did anything that I tried. I searched a lot for similar issues, but nothing worked for me. Is there a way of achieving it with any Linux command line utility (sed, awk, etc)?

Thanks a lot in advance.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Vagelis Prokopiou
  • 2,285
  • 19
  • 14

2 Answers2

2

/<ROW>/,/<\/ROW>/ won't work, because sed uses greedy matching; it matches everything from the first /<ROW>/ to the last /<\/ROW>/.

You'll have to use one of the advanced features of sed. The simplest is probably the hold space.

This:

sed -n '/<ROW>/{h;d;};H;`

will store an entire ROW block in the hold space, and overwrite it when it encounters a new ROW block. (And print nothing.)

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;p;}

will store the entire ROW block, then print it out when it is complete.

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;p;}'

will do the same, but will delete a block that does not contain "title 1".

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;/1,3/p;}'

will do the same, but print only if the block contains "1,3". (You can spell out the matching lines more explicitly; I'm trying to keep this code concise.)

Beta
  • 96,650
  • 16
  • 149
  • 150
  • Could you please explain more about `{h;d;};H;` and `{g;/title 1/!d;p;}'`? – Thân LƯƠNG Đình Sep 27 '20 at 15:08
  • @ThanLUONG: `//{h;d;}` means *"if it contains ROW, put this line in the hold space (overwriting whatever was there), then delete the line (and begin again with the next line)".* `/title 1/!d` means *"if it doesn't contain 'title 1', delete it (and begin again with the next line)".* – Beta Sep 27 '20 at 15:22
2

This might work for you (GNU sed):

sed '/<ROW>/{:a;N;/<\/ROW>/!ba;/TITLE.*title 1/!b;/VERSION.*1,3/b;d}' file

Gather up lines between <ROW> and </ROW>.

If the lines collected don't contain the correct title, bail out.

If the lines collected do contain the correct version bail out.

Otherwise delete the lines collected.

potong
  • 55,640
  • 6
  • 51
  • 83