0

I am using sed to replace all HTML tags in a file

Text:

<html>
   <body>
      <h1>Hello World!</h1>
   </body>
</html>

I have checked that basic regular expressions <.*\?> and <[^>]*> match only HTML tags in the text.

When I use sed 's/<.*\?>//g' [input-file], sed replaces everything and five blank lines are printed, whereas, sed 's/<[^>]*>//g [input-file] produces the correct output and first prints two blank line, then Hello World! with appropriate indentation on the next line and last two blank line.

Why does it behave differently for similar matches?

HarshvardhanSharma
  • 754
  • 2
  • 14
  • 28
  • 3
    Do not use `sed` for HTML text parsing, use syntax aware parsers – Inian Dec 28 '17 at 08:27
  • 1
    sed doesn't support non-greedy.. see https://www.gnu.org/software/sed/manual/sed.html#Regular-Expressions-Overview and https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y – Sundeep Dec 28 '17 at 09:05

0 Answers0