This ought to work, on at least on fairly tidy HTML:
</?\s*(?!(i|b|span)\b)\w+[^>]*>
A blow by blow explanation (courtesy of http://rick.measham.id.au/paste/explain.pl):
NODE EXPLANATION
< literal '<'
/? '/' (optional)
\s* any whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
(?! look ahead to see if there is not:
( start of OR'ed group
i 'i'
| OR
b 'b'
| OR
span 'span'
) end of the OR'ed group
\b the boundary between a word char (\w)
and something that is not a word char
) end of look-ahead
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
> literal '>'
Now what does this do in English?
It
- looks for the start of any tag
<
- matches an optional tag end
/
because you want to find both opening and closing tags (<body>
and </body>
)
- skips any amount of whitespace (which is allowed here, and – come to think of it – on several other places. So if necessary, add to taste.)
- the start of the negative lookahead. This is what Wiktor Stribiżew referred to and is explained in depth in Regular expression to match a line that doesn't contain a word?.
- the OR'ed list of phrases to match not appear inside the lookahead. I added parentheses around to group them because ...
- there are other tags that start with
b
and i
! The parentheses, followed by the \b
is to make sure it matches 'whole words' in the OR list only.
- the following
\w+
is to match any tag that follows (which, may I remind you, may not be i
, b
, or span
per the negative lookahead).
- But HTML tags do not end there! (At least, opening tags don't.) After the tag name itself, just about any amount of attributes may appear. There is a rule, observed casually by most HTML editors and software, that the character
>
may not appear inside such an attribute – it should be encoded as >
. So to match anything up to the very end of this tag, skip anything that is not >
.
- ... closed by a final
>
, to match the end.
Why the warning for 'fairly tidy HTML' at the top? Because even though HTML is described in excruciating detail, neither software nor (alas) humans who manually enter HTML observe all those pesky rules. A few possible problems that can occur with this regex:
- Self-closing tags.
<br />
will not be caught.
- Unescaped
>
in attribute values. <img title="a > b">
will make it choke – the <img
part and the first half of the title will be removed, but the second part and the final >
character will remain.
- Random capitalization. HTML is indifferent of capitalization in tags, and you can open with
<B>
and close with </b>
- but regexes are usually case sensitive by default. Your regex flavor may have an Ignore Case flag; if not, you need to add the capitalized characters as well.
- Blatantly malformed HTML. (There is no cure for that.)
- Probably countless others.
The best remedy is to ensure the HTML that goes "in" is already as clean as possible. You can use common tools such as HTMLTidy to preprocess your file. Better yet: do not attempt to make "RegEx match open tags except XHTML self-contained tags". (Paste the quoted text into any browser search engine for some fun.) A far more superior solution is to use a HTML parser, and simply kick out tags you don't like. If your HTML is actually (properly formed) XHTML, this can also be done with XSLT, the generalized XML processor language.