0

Here is the file (named it as sample.xml):


<?xml version="1.0" encoding="UTF-8"?>
<configs>

    <blah1 value="ma">
      <tag3>100MB</tag3>
    </blah1>

    <blah1 value="ba">
      <tag3>20MB</tag3>
    </blah1>

     <blah2 value="*" version="1.0" result="true">
        <blah1 value="xyz">
          <blah1 value="uvw" result="true">
             <tag>4</tag>
          </blah1>
        </blah1>
     </blah2>

  <!-- This is tag with def value -->
  <blah2 value="*" version="2.0" result="true">
    <blah1 value="abc">
      <blah1 value="def" result="true">
        <tag2>on</tag2>
      </blah1>
    </blah1>
  </blah2>

</configs>

On finding a string with value="def", remove the entire block beginning from <blah2> to </blah2> tags

Am not familiar with sed hold pattern but something I got from google which is very close

sed -n '/<blah2.*>/,/<\/blah2>/{
                                  H
                                  /<\/blah2>/ { 
                                        s/.*//;x
                                       /def/d
                                       p 
                                  }
                               }' sample.xml

Expected result:


<?xml version="1.0" encoding="UTF-8"?>
<configs>

    <blah1 value="ma">
      <tag3>100MB</tag3>
    </blah1>

    <blah1 value="ba">
      <tag3>20MB</tag3>
    </blah1>

     <blah2 value="*" version="1.0" result="true">
        <blah1 value="xyz">
          <blah1 value="uvw" result="true">
             <tag>4</tag>
          </blah1>
        </blah1>
     </blah2>

</configs>

Actual result (with above non-working sed):

     <blah2 value="*" version="1.0" result="true">
        <blah1 value="xyz">
          <blah1 value="uvw" result="true">
             <tag>4</tag>
          </blah1>
        </blah1>
     </blah2>
Cyrus
  • 84,225
  • 14
  • 89
  • 153
satya
  • 101
  • 1
  • 8
  • 3
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jun 24 '19 at 06:33
  • Is `def` always an attribute of tag `/configs/blah2[2]/blah1/blah1`? – Cyrus Jun 24 '19 at 06:39
  • Right, currently xmlstarlet is not available on that host(which is some proprietary linux) and no internet access. I will have to download and move the binary there. – satya Jun 24 '19 at 06:42
  • Yes, def is always attribute in second blah1 – satya Jun 24 '19 at 06:43
  • Would an answer with xmlstarlet help you? – Cyrus Jun 24 '19 at 06:46
  • I would like to know the normal sed answer and then xmlstarlet answer is bonus. BTW, to make it simple you can assume def can be either of the inside blah1 tags. Can we make the hold space and pattern space work for this scenario? – satya Jun 24 '19 at 07:14

3 Answers3

3

Delete second tag blah2 with xmlstarlet:

xmlstarlet edit --delete '//configs[blah2[2]/blah1/blah1[@value="def"]]/blah2[2]' file.xml

Output:

<?xml version="1.0" encoding="UTF-8"?>
<configs>
  <blah1 value="ma">
    <tag3>100MB</tag3>
  </blah1>
  <blah1 value="ba">
    <tag3>20MB</tag3>
  </blah1>
  <blah2 value="*" version="1.0" result="true">
    <blah1 value="xyz">
      <blah1 value="uvw" result="true">
        <tag>4</tag>
      </blah1>
    </blah1>
  </blah2>
</configs>

If you want to edit file inplace, add option -L.


Explanation of the used XPath:

//configs[blah2[2]/blah1/blah1[@value="def"]]/blah2[2]
|---A---| |-------------B------------------| |---C---|

A and B: path to the attribute you are looking for

A and C: path to the tag to be deleted

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • but this one doesn't check for tag with value "def". The number of tag sections can be more or less. – satya Jun 24 '19 at 07:29
  • Great! one question. can xml comment above this tag section be deleted using xmlstarlet? – satya Jun 24 '19 at 08:22
  • Add `--delete '//configs/comment()'` to `xmlstarlet edit` command to delete all commands in tag `configs`. – Cyrus Jun 24 '19 at 08:26
  • Thank you @cyrus. I am just choosing the sed answer as accepted solution because of the current restrictions I have. Ideally I would like to choose both answers as accepted answers – satya Jun 24 '19 at 10:18
  • @satya - Here's another (untested) xpath option: `//blah2[.//blah1/@value='def']` – Daniel Haley Jun 25 '19 at 17:58
1

This might work for you (GNU sed):

sed '/<blah2.*>/{:a;N;/<\/blah2.*>/!ba;/value="def"/d}' file

If a line contains <blah2.*> gather up all lines until a line containing <\/blah2.*>, then test those lines for the string value="def" and if found, delete those lines.

potong
  • 55,640
  • 6
  • 51
  • 83
  • One problem is we cannot make use of previous tag section values like "ma" and "ba".. They may or may not be present – satya Jun 24 '19 at 07:16
  • @satya: `b` is a `sed` command. `:a` is a label with name `a`. `/<\/blah2.*>/!ba`: If `<\/blah2.*>` was not found in pattern space then jump to label `a`. This is a loop until `<\/blah2.*>` is found. – Cyrus Jun 24 '19 at 07:27
  • oh! my bad. Let me try this – satya Jun 24 '19 at 07:30
  • This works great! One last thing that I forgot to mention that, sometimes there is comment above the tag section. Can that be removed as part of sed? If not I can delete it separately as another sed. – satya Jun 24 '19 at 08:18
  • If the comment line is the line before the `` line use: `sed 'N;//{:a;N;/<\/blah2.*>/!ba;/value="def"/d};P;D' file` – potong Jun 24 '19 at 17:47
0

Since you're happy with a sed solution, here's a better (clearer, more portable, etc.) alternative given your posted sample input/output:

$ awk -v RS= -v ORS='\n\n' '!/value="def"/' file
<?xml version="1.0" encoding="UTF-8"?>
<configs>

    <blah1 value="ma">
      <tag3>100MB</tag3>
    </blah1>

    <blah1 value="ba">
      <tag3>20MB</tag3>
    </blah1>

     <blah2 value="*" version="1.0" result="true">
        <blah1 value="xyz">
          <blah1 value="uvw" result="true">
             <tag>4</tag>
          </blah1>
        </blah1>
     </blah2>

</configs>

If that's not all you need, there's a better awk alternative for whatever it is you do need since sed is only best for doing s/old/new on individual strings.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • is it possible to remove comment line(s) along with the block. The comment can be single line before or a multi line one ```xml or ``` – satya Jun 24 '19 at 13:33
  • Look at the output I posted - it did remove the comment line (see how `` is not present in the output?) and it doesn't matter if it's single or multiple lines – Ed Morton Jun 24 '19 at 13:35
  • Ok Right, I tried and it works if they are separated by newline. If I remove newline between blah2 blocks or if I remove new line between last blah2 block and the configs, it doesn't work :( – satya Jun 24 '19 at 13:51
  • @satya all we have to go on is the sample input you provide. If your real input doesn't follow the same format as your sample input then you shouldn't expect any solution that doesn't use an XML parser to work robustly. If you can use an XML parser then you should. If you can't then update your example to truly reflect your files format and you may be able to get a sed or awk answer to handle that specific format. – Ed Morton Jun 24 '19 at 14:47