Regex match not this or that

Question

I'm trying to remove all tags other than italics, bold, or span tags, and can't seem to get it to work.

Currently, I have:

/[^i|b|span]/g

I understand that [] is range, and that [span] will match s,p,a,n, rather than the whole word.

So my question is how to state: not: "tag1" or "tag2"?

EDIT I found the 'duplicate' question earlier, and it did not solve my issue.

`[^i|b|span]` is a regex that matches any character other than 'a', 'b', 'i', 'n', 'p', 's' or '|'. — bipll, Apr 11 '16 at 13:04
@RadLexus - It works, but i don't recognize parts of it, could you explain? I know the negative look-around, but not the rest. — Karric, Apr 11 '16 at 17:46

score 2 · Accepted Answer · edited May 23 '17 at 11:59

This ought to work, on at least on fairly tidy HTML:

</?\s*(?!(i|b|span)\b)\w+[^>]*>

A blow by blow explanation (courtesy of http://rick.measham.id.au/paste/explain.pl):

NODE                   EXPLANATION
 <                     literal '<'
/?                     '/' (optional)
\s*                    any whitespace (\n, \r, \t, \f, and " ") (0 or
                       more times (matching the most amount
                       possible))
(?!                    look ahead to see if there is not:
  (                      start of OR'ed group
    i                        'i'
   |                        OR
    b                        'b'
   |                        OR
    span                     'span'
  )                      end of the OR'ed group
  \b                     the boundary between a word char (\w)
                         and something that is not a word char
)                      end of look-ahead
\w+                    word characters (a-z, A-Z, 0-9, _) (1 or
                       more times (matching the most amount
                       possible))
[^>]*                  any character except: '>' (0 or more times
                       (matching the most amount possible))
>                      literal '>'

Now what does this do in English?

It

looks for the start of any tag <
matches an optional tag end / because you want to find both opening and closing tags (<body> and </body>)
skips any amount of whitespace (which is allowed here, and – come to think of it – on several other places. So if necessary, add to taste.)
the start of the negative lookahead. This is what Wiktor Stribiżew referred to and is explained in depth in Regular expression to match a line that doesn't contain a word?.
the OR'ed list of phrases to match not appear inside the lookahead. I added parentheses around to group them because ...
there are other tags that start with b and i! The parentheses, followed by the \b is to make sure it matches 'whole words' in the OR list only.
the following \w+ is to match any tag that follows (which, may I remind you, may not be i, b, or span per the negative lookahead).
But HTML tags do not end there! (At least, opening tags don't.) After the tag name itself, just about any amount of attributes may appear. There is a rule, observed casually by most HTML editors and software, that the character > may not appear inside such an attribute – it should be encoded as >. So to match anything up to the very end of this tag, skip anything that is not >.
... closed by a final >, to match the end.

Why the warning for 'fairly tidy HTML' at the top? Because even though HTML is described in excruciating detail, neither software nor (alas) humans who manually enter HTML observe all those pesky rules. A few possible problems that can occur with this regex:

Self-closing tags. <br /> will not be caught.
Unescaped > in attribute values. <img title="a > b"> will make it choke – the <img part and the first half of the title will be removed, but the second part and the final > character will remain.
Random capitalization. HTML is indifferent of capitalization in tags, and you can open with <B> and close with </b> - but regexes are usually case sensitive by default. Your regex flavor may have an Ignore Case flag; if not, you need to add the capitalized characters as well.
Blatantly malformed HTML. (There is no cure for that.)
Probably countless others.

The best remedy is to ensure the HTML that goes "in" is already as clean as possible. You can use common tools such as HTMLTidy to preprocess your file. Better yet: do not attempt to make "RegEx match open tags except XHTML self-contained tags". (Paste the quoted text into any browser search engine for some fun.) A far more superior solution is to use a HTML parser, and simply kick out tags you don't like. If your HTML is actually (properly formed) XHTML, this can also be done with XSLT, the generalized XML processor language.

Regex match not this or that

1 Answers1