Before you go any further:
Quoting from RegEx match open tags except XHTML self-contained tags :
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. [...] Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
Here's a step-by-step solution to solve your issue:
- Use a XML parser. If you only have the full HTML;
- Use
htmlspecialchars()
or htmlentities()
on the content.
I won't explain how to do this, since there's already loads of articles on Google about this subject.
And, please, STOP using regular expressions to handle HTML!