8

I'm looking to match all less than ('<') or greater than ('>') signs in a file using sed. I only want to match the single character

My goal is to replace them with ' <' and '> ' (ensure they have white space around them so I can parse them easier) respectively.

For example, it would match: (without space within the tags)

< p >Hey this is a paragraph.< /p >< p >And here is another.< /p >

.. and turn it into (note the spaces)

 < p > Hey this is a paragraph. < /p >  < p > And here is another. < /p > 



Here's what my initial (wrong) guess was:

sed 's/<{1}|>{1}/ <> /' ...


It matches the whole word/line, which is not desired, and it also does not replace correctly.

Anyways, any help would be appreciated! Thanks!

jiman
  • 270
  • 1
  • 3
  • 8
  • 2
    You _really_ don't want to parse HTML with regular expressions. Use an HTML parser. (see http://stackoverflow.com/a/1732454/ which is one of the most-upvoted answers on SO for good reason) – Wooble Dec 21 '11 at 15:11
  • @Wooble: while I generally agree to your assertion, using regexes can still be okay for testing and the like ... if it exceeds this, use a proper parser, though. – 0xC0000022L Dec 21 '11 at 15:27
  • Haha yeah, I know. I've seen that one. I'm writing a toy academic HTML formatter in perl for a very small subset of tags. I am just using sed and regex to ensure it has the whitespace that my perl code needs. – jiman Dec 21 '11 at 15:30

1 Answers1

8

Try two substitutions to make it easier:

sed 's/</ </g ; s/>/> /g' file
sidyll
  • 57,726
  • 14
  • 108
  • 151