0

I have a large flat file containing many instances of a repeated string I would like to remove:

<content type="html">
  &lt;p&gt; &lt;/p&gt;
  &lt;p&gt;Jump around on couch, meow constantly until given food.&lt;/p&gt;
  &lt;p&gt; &lt;/p&gt;
</summary>

Because you can't parse [X]HTML with regex I'm looking for a solution where I don't have to write my own regex. I tried using tr without any luck. Here's my desired output:

<content type="xhtml">

  &lt;p&gt;Jump around on couch, meow constantly until given food.&lt;/p&gt;

</summary>

How can I remove the repeating string from bash without writing regex?

vhs
  • 9,316
  • 3
  • 66
  • 70
  • 1
    since it is xml, look into https://stackoverflow.com/tags/xmlstarlet/info.. I haven't used it personally, so I don't how it can be used for this case... – Sundeep Jul 12 '17 at 12:56

2 Answers2

-1

I used a tool called rpl which didn't require me to write any regex:

$ rpl '&lt;p&gt; &lt;/p&gt;' '' /tmp/file

Really DELETE all occurences of &lt;p&gt; &lt;/p&gt; (case sensitive)? (Y/[N]) Y
Replacing "&lt;p&gt; &lt;/p&gt;" with "" (case sensitive) (partial words matched)
A Total of 55 matches replaced in 1 file searched.

Installed via Homebrew with brew install rpl. Finished in 2 minutes.

vhs
  • 9,316
  • 3
  • 66
  • 70
-1

With the knowledge of regular expressions it would be:

sed -i.bck 's~&lt;p&gt; &lt;/p&gt;~~g' /tmp/file
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thanks for providing a solution. I've updated the question to try and make it more clear what I'm trying to achieve and why RegExp may not be the best approach for my needs. – vhs Jul 12 '17 at 13:13