Remove a static block of text from file containing variable data

Question

I have a static block of text that I need to remove from a file created nightly from concatenating multiple files into one. The text cuts across 6 line as one block and has a bunch of special characters like " , > , and / . I know I should be able to use awk, sed, or perl, but I can't get the escape of the special characters correct, either it errors or does not find the block.

The block is always this on the separate lines:

</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>

I want to change

</item>
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
<item>

into

</item>
<item>

it appears 8 times in the file that is created by concatenating multiple streams.

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Aug 09 '21 at 17:14
Please add sample input (valid XML, no descriptions, no images, no links) and your desired output for that sample input to your question (no comment). — Cyrus, Aug 09 '21 at 17:14
is there something special/particular about this block or do you want to remove **ALL** blocks like ` ... ` — markp-fuso, Aug 09 '21 at 17:49
This particular block, which appears multiple times, nothing else matches the — eli blaustein, Aug 09 '21 at 18:06

score 0 · Answer 1 · answered Aug 09 '21 at 18:01

0

With GNU awk for multi-char RS:

$ awk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0; next} s=index($0,rmv){$0=substr($0,1,s-1) substr($0,s+length(rmv))} 1' remove file
</item>
<item>

The above wil work for any chars in the files since it's just doing a literal string comparison and it was run on these input files:

$ head remove file
==> remove <==
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>

==> file <==
</item>
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
<item>

answered Aug 09 '21 at 18:01

Ed Morton

188,023
17
78
185

appears to only remove the 1st instance – markp-fuso Aug 09 '21 at 18:19
@markp-fuso Right, that's all there is in the sample input. If the OP wants more than 1 instance removed they should include at least 2 in the sample input/output. – Ed Morton Aug 09 '21 at 18:22
The text between instances is like 50,000 characters – eli blaustein Aug 09 '21 at 18:32
@eliblaustein if you're saying you can't show 2 instances in your example because in your real data there's 50,000 characters between them - no, we don't need to see all of that, just create a [mcve] with a couple of lines between each block and show **that** in your question. – Ed Morton Aug 09 '21 at 19:18

score 0 · Accepted Answer · answered Aug 09 '21 at 18:30

Assumptions:

all blocks like </channel> ... </link> are to be removed
OP has stated the file has 8x of these blocks
actual data is formatted as in OP's sample inputs (otherwise, as Cyrus has mentioned, an XML/HTML parser may be more appropriate)

Sample data:

$ cat sample.dat
</item> keep this line
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
<another_item> keep this line

<link><![CDATA[https://www.example.com/KEEP_THIS_LINE]]></link>

</another_item> keep this line
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
</one_more_item> keep this line

One sed idea to find ranges of rows with the bookends </channel> and </link> and then delete said ranges:

$ sed '/<\/channel>/,/<\/link>/d' sample.dat
</item> keep this line
<another_item> keep this line

<link><![CDATA[https://www.example.com/KEEP_THIS_LINE]]></link>

</another_item> keep this line
</one_more_item> keep this line

Once OP has verified the accuracy of the answer the -i flag can be added if the intention is to overwrite the input file with the results.

Almost, is removes the part of the block at the end of the file, not a big deal for me as I can re-add it, but that would be an issue for someone looking at this for a general solution. — eli blaustein, Aug 09 '21 at 18:48
I'm not sure how `part of the block` can be removed ... could you update the question with an example of what you're talking about re: `part of the block at the end of the file`? — markp-fuso, Aug 09 '21 at 18:53
the file ends with this get removed with your sed command, even though the rest of the text is not present, I just readd those tags to the end of the file which works for me. Thank you for the help — eli blaustein, Aug 09 '21 at 19:03

Remove a static block of text from file containing variable data

2 Answers2