1

I have a static block of text that I need to remove from a file created nightly from concatenating multiple files into one. The text cuts across 6 line as one block and has a bunch of special characters like " , > , and / . I know I should be able to use awk, sed, or perl, but I can't get the escape of the special characters correct, either it errors or does not find the block.

The block is always this on the separate lines:

</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>

I want to change

</item>
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
<item>

into

</item>
<item>

it appears 8 times in the file that is created by concatenating multiple streams.

markp-fuso
  • 28,790
  • 4
  • 16
  • 36

2 Answers2

0

With GNU awk for multi-char RS:

$ awk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0; next} s=index($0,rmv){$0=substr($0,1,s-1) substr($0,s+length(rmv))} 1' remove file
</item>
<item>

The above wil work for any chars in the files since it's just doing a literal string comparison and it was run on these input files:

$ head remove file
==> remove <==
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>

==> file <==
</item>
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
<item>
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • appears to only remove the 1st instance – markp-fuso Aug 09 '21 at 18:19
  • @markp-fuso Right, that's all there is in the sample input. If the OP wants more than 1 instance removed they should include at least 2 in the sample input/output. – Ed Morton Aug 09 '21 at 18:22
  • The text between instances is like 50,000 characters – eli blaustein Aug 09 '21 at 18:32
  • @eliblaustein if you're saying you can't show 2 instances in your example because in your real data there's 50,000 characters between them - no, we don't need to see all of that, just create a [mcve] with a couple of lines between each block and show **that** in your question. – Ed Morton Aug 09 '21 at 19:18
0

Assumptions:

  • all blocks like </channel> ... </link> are to be removed
  • OP has stated the file has 8x of these blocks
  • actual data is formatted as in OP's sample inputs (otherwise, as Cyrus has mentioned, an XML/HTML parser may be more appropriate)

Sample data:

$ cat sample.dat
</item> keep this line
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
<another_item> keep this line

<link><![CDATA[https://www.example.com/KEEP_THIS_LINE]]></link>

</another_item> keep this line
</channel>
</rss><?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title><![CDATA[Example]]></title>
<description><![CDATA[Example]]></description>
<link><![CDATA[https://www.example.com/]]></link>
</one_more_item> keep this line

One sed idea to find ranges of rows with the bookends </channel> and </link> and then delete said ranges:

$ sed '/<\/channel>/,/<\/link>/d' sample.dat
</item> keep this line
<another_item> keep this line

<link><![CDATA[https://www.example.com/KEEP_THIS_LINE]]></link>

</another_item> keep this line
</one_more_item> keep this line

Once OP has verified the accuracy of the answer the -i flag can be added if the intention is to overwrite the input file with the results.

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Almost, is removes the part of the block at the end of the file, not a big deal for me as I can re-add it, but that would be an issue for someone looking at this for a general solution. – eli blaustein Aug 09 '21 at 18:48
  • I'm not sure how `part of the block` can be removed ... could you update the question with an example of what you're talking about re: `part of the block at the end of the file`? – markp-fuso Aug 09 '21 at 18:53
  • 1
    the file ends with this get removed with your sed command, even though the rest of the text is not present, I just readd those tags to the end of the file which works for me. Thank you for the help – eli blaustein Aug 09 '21 at 19:03