Remove text between substrings (no matter on the same line or multiline) only if it contains pattern

Question

There is some data (xml) in a file, and I need to remove text (not the whole line, so /d option of sed does not suit) from Substring1 up to Substring2 (including both) only if contains a pattern. My problem here is that there could be various formatting, so Substring1 and Substring2 can be either on the same line or on different, or there could be several pairs of Substrin1/2 on the same line.

Example (1st line - 2 pairs of Substrings1/2 and first one contains PATTERN, 2nd line - 1 pair with PATTERN, 3rd line - 1 pair without PATTERN, 4th and 5th lines - 1 pair with PATTERN, 6th and 7th lines - 1 pair without PATTERN):

Substring1 = <?xml

Substring2 = </update>

Pattern = PATTERN

tmp.log
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

Expected output:
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

I`ve tried (without full success) different combinations like the following:

sed -i "s#<?xml.*PATTERN.*</update>##g" tmp.log

sed -i "#<?xml#{p; :a; N; #</update>#!ba; s#.*\n##}; p" tmp.log

perl -pi -e 's/<?xml.*PATTERN.*update>//' tmp.log

As far as I can see, these remove whole lines and skip the case when substrings are located on different lines. I also do not perform real checking for PATTERN here. Any help appreciated.

1480 items match `[sed] xml` when you search here. Did you look at any of them? Good luck. — shellter, Jul 14 '16 at 12:06
Thanks for the comments. Yes, I`ve tried more than ten different at least. The thing is that the formatting is basically xml, but newlines could be everywhere. @ssr1012 adding my expected output as an update in the post — nrp, Jul 14 '16 at 14:05
Obligatory link to essay on parsing *ML with regexp: http://stackoverflow.com/a/1732454/936986 — Oleg V. Volkov, Jul 14 '16 at 14:44
@ssr1012 Actually, it is pretty the same all time: is just one of the tags inside a string, the final one is always "". And no tags like "". — nrp, Jul 17 '16 at 06:56

score 2 · Accepted Answer · answered Jul 14 '16 at 14:29

2

With gawk:

awk -v RS='<\\?xml' 'NR!=1 && !(/PATTERN/){print "<?xml",$0}'

answered Jul 14 '16 at 14:29

jijinp

2,592
1
13
15

zdim · Answer 2 · 2016-07-15T09:26:56.530

If there is actually any more of this please use the good modules for XML. Both XML::libXML and XML::Twig are excellent. That said, here is direct parsing.

use warnings;
use strict;

# Sample text for testing
my $text = q(start <?xml with PATTERN yes </update> and <?xml good </update> end); 

my $beg  = qr(<\?xml);
my $end  = qr(</update>);
my $patt = qr(PATTERN);

$text =~ s|$beg.*?$patt.*?$end||gs;

print "$text\n";

The .*? is non-greedy. The newlines are taken care of by the modifier /s which makes . match them. Since the text in the question is unclear to me I've used the $text above as input:

start <?xml with PATTERN yes </update> and <?xml good </update> end

With this input in $text, the above code prints

start  and <?xml good </update> end

score 0 · Answer 3 · answered Jul 14 '16 at 15:55

Please try this one:

use strict;
use warnings;

my $newDATA = "";
while(<DATA>)
{
    my $each_line = $_;  my $dump = $each_line;
        my ($pre,$match,$post) = "";
        while($each_line=~/<\?xml((?:(?!<\?xml|\n).)*)<\/update>/sg)
        {
            $pre = $pre.$`; $match=$&; $post = $'; my $dupmatch = $match;
            if($dupmatch=~m/PATTERN/i)
            {  $match = "";  }
            $pre = $pre.$match; $each_line = $post;
        }
        if(length $pre) {  $each_line = $pre.$post;  }
        $newDATA .= $each_line;
}
$newDATA=~s/\n{,1}/\n/g;
print $newDATA;

INPUT:

__DATA__
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

OUTPUT:

<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

Your XML tagging is very inconsistent. Could you please check and the above perl coding.

Remove text between substrings (no matter on the same line or multiline) only if it contains pattern

3 Answers3