How to remove blocks of text from file

Question

EDIT: Did not mentioned before this is to be executed in OS X

I'm trying to create a bash script which will remove some blocks from a file and save the result to another one.

The file's content I want to filter should look like this:

<element>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
</element>
<element>
    <subElement name="removeme"/>
    <subElement name="removeme"/>
    <subElement name="removeme"/>
</element>
<element>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
</element>

What I want to remove is the group including the <element></element> tags which contains subelements <subElement name="removeme"/>

It's guaranteed that no group will have "removeme" and "leaveme" elements mixed.

I know how to do this with a regular expression like this:

<element>(?:(?!/elem).)*"removeme".*?</element>

but i'm really lost on how to do it in a shell script, had found some info about sed but did not understood how to acomplish that.

Thanks.

`sed` is not so good for this task. Try `awk` instead. Have a look at Jotne's answer (or possibly mine) [here](http://stackoverflow.com/questions/24814783/extract-multiple-lines-on-either-end-of-pattern-which-are-enclosed-by-an-identif/24815006?noredirect=1#comment38527405_24815006). It's basically the opposite of what you want but you should be able to adapt it. — ooga, Jul 19 '14 at 01:25
I did look at it but it just uses some separartors to define the removed content, I need to know if the content contains a certain text to determine if remove it or not, is possible to adapt it? — Gusman, Jul 19 '14 at 01:29
It uses both separators (like your `` tags) and also content. I think it would be easy to adapt. I'll try it and let you know if it isn't applicable, but I think it is. — ooga, Jul 19 '14 at 01:35
Also note glenn jackman's answer, which is really more appropriate and definitely more bulletproof. — ooga, Jul 19 '14 at 01:37

score 3 · Answer 1 · edited May 23 '17 at 11:50

Regular expressions are certainly the wrong tool to parse XML. You want an XML processing tool to remove nodes matching the xpath //element[subElement[@name="removeme"]]

element nodes that have a subElement child which has a name attribute with the value removeme

Using xmlstarlet:

xmlstarlet ed -d '//element[subElement[@name="removeme"]]' << ENDXML
<elements>
   <element>
      <subElement name="leaveme"/>
      <subElement name="leaveme"/>
      <subElement name="leaveme"/>
   </element>
   <element>
      <subElement name="removeme"/>
      <subElement name="removeme"/>
      <subElement name="removeme"/>
   </element>
   <element>
      <subElement name="leaveme"/>
      <subElement name="leaveme"/>
      <subElement name="leaveme"/>
   </element>
</elements>
ENDXML

<?xml version="1.0"?>
<elements>
  <element>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
  </element>
  <element>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
  </element>
</elements>

Did try it but no xmlstarlet was found on os x bash, any replacement for it? — Gusman, Jul 19 '14 at 01:32
@Gusman You would need to install [xmlstarlet](http://xmlstar.sourceforge.net/overview.php). — ooga, Jul 19 '14 at 01:59
This is the right way to go, but its not installed in most system by default, and not all user have right to add another tool. — Jotne, Jul 19 '14 at 07:13

score 1 · Accepted Answer · edited May 23 '17 at 12:21

The idea of the following (based on Jotne's post here) is to collect all lines of the file in the lines array. The position of the <element> and </element> tags are saved in i_start and i_end, respectively. If <subElement name="removeme"/> was seen, found is set to 1 (true). i_end is conditionally set to either 0 if found is true or to the line number (array index) of the end element if found is not true. The block between the begin and end tags is printed if i_end is not zero.

awk '
  { lines[NR] = $0 }
  /<element>/   { i_start = NR }
  /<\/element>/ { i_end = found ? 0 : NR; found = 0 }
  /<subElement name="removeme"\/>/ { found = 1 }
  i_end {
    for (i = i_start; i <= i_end; i++)
      print lines[i]
    i_end = 0;
  }
' file

score 1 · Answer 3 · answered Jul 19 '14 at 07:05

Using gnu awk you can do it like this:

awk -v RS="<element>" '!/removeme/ && NR>1{print RS $0}' file
<element>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
</element>

<element>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
    <subElement name="leaveme"/>
</element>

By setting the RS to <element> you are telling awk to work in block mode and it starts with <element>
Then the !/removeme/ tells awk not to print the block with removeme data.

lledr · Answer 4 · 2014-07-19T10:59:13.863

Using sed:

sed -n '
    /<element>/h
    /<element>/!H
    /<\/element>/{g;/<subElement name="removeme"\/>/!p;}
' file

The /<element>/h command initializes on match the hold space with the pattern space contents.

The /<element>/!H command appends the pattern space contents to the hold space if the line doesn't match <element>.

The /<\/element>/{g;/<subElement name="removeme"\/>/!p} command tests for the closing tag and on match executes the two subsequent commands:

Populated hold space is copied to the pattern space. Now regular expressions are tested againsts the updated pattern space containing the whole element block.
Regular expression looks for the filtering subelement value ; on no match, the pattern space gets printed.

How to remove blocks of text from file

4 Answers4