0

File:

<img src="url1" parameters1 width="22"> some text
<img src="url2" parameters2 width="44"> another text

i want to become:

 some text
 another text

i need sed/awk Linux utility or such to do following: find first occurence of <img and then continue search down occurrence of "> and erase (or replace by nothing) everything in between including <img and "> then continue like that till the end of the file to get rid of all such blocks of text.

Note that the characters between the searched phrases may be various like {};'"$%#=

I will then use that command on each file in a directory and its subdirectories.

16851556
  • 255
  • 3
  • 11
  • 5
    Use an XML parser instead of text based unix tools – anubhava May 26 '20 at 14:46
  • Some call it [summoning the daemon](https://www.metafilter.com/86689/), others refer to it as [the Call for Cthulhu](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) and few just [turned mad and met the Pony](https://stackoverflow.com/a/1732454/8344060). In short, never parse XML or HTML with a regex! Did you try an XML parser such as `xmlstarlet`, `xmllint` or `xsltproc`? – kvantour May 26 '20 at 16:06
  • Haven't tested but xml parsers might fail since this is not proper xml. `img` doesn't have closing tag. – karakfa May 26 '20 at 16:18
  • @karakfa this seems to be an XY problem. The person clearly already was able to extract parts of an XML/HTML file. – kvantour May 26 '20 at 16:51
  • Post vaild HTML/XHTML/XML. – Cyrus May 26 '20 at 17:30
  • This: `sed 's/]*\">//'` will work *most of the time.* Use at your own risk. – Beta May 26 '20 at 18:15

0 Answers0