0

I have a white list of HTML end tags (br, b, i, div):-

String whitelist = "([^br|^b|^i|^div])";
String endTagPattern = "(<[ ]*/[ ]*)" + whitelist + "(>?).*?([^>]+>)";
...
html = html.replaceAll(endTagPattern, "[r]");

Which takes my test String and removes the end tags of those not in the white list, in this case replaced by [r] for clarity:-

1. <b>bold</b>, 2. <i>italic</i>, 3. <strong>strong</strong>, 4. <div>div</div>, 5. <script lang='test'>script</script>
1. <b>bold</b>, 2. <i>italic</i>, 3. <strong>strong[r], 4. <div>div</div>, 5. <script lang='test'>script[r]

If I add strong to this white list

String whitelist = "([^br|^b|^i|^div|^strong])";

Not only does it not match the strong end tag, it also stops matching that of the end script tag or any other for that matter.

My question is, why?

Daniel Larsson
  • 6,278
  • 5
  • 44
  • 82
Ross Drew
  • 8,163
  • 2
  • 41
  • 53
  • 1
    [This answer might be pertinent](http://stackoverflow.com/a/1732454/2071828). You have also not understood how regex works - the pattern `[^br|^b|^i|^div|^strong]` is a character group that matches **not** `b` or `r` or `|` or `d` or `i` etc... – Boris the Spider Jan 03 '14 at 15:08
  • I realise that parsing HTML in any complex way is painful if not impossible but it should be possible to remove tags here and there no? – Ross Drew Jan 03 '14 at 15:10
  • 1
    (1) using HTML and regex is really bad idea. You should use parser instead. (2) It seems you are confusing [groups](http://www.regular-expressions.info/brackets.html) `(...)` and [character classes](http://www.regular-expressions.info/charclass.html) `[...]`. – Pshemo Jan 03 '14 at 15:11
  • 1
    The point is that no, it is not possible. Because tags can feature in all sorts of places - in attribute values for example. Regex is simply to capable of breaking down full HTML. – Boris the Spider Jan 03 '14 at 15:12

2 Answers2

4

The reason for this is that you are using a character class. Inside a character class, the order of characters does not really matter except if you're dealing with character ranges.

So, [^br|^b|^i|^div|^strong] actually will match any character except those:

bridvstrong|^

[Note that | and ^ are there too].

You could have used [^bridvstrong|^] and it would behave the same way.

You might instead look into negative lookaheads.

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • 2
    Ah phooey, thought I'd get around doing some importing with some simple regex. I was wrong, I've learned my lesson. No more HTML and Regexing for me. – Ross Drew Jan 03 '14 at 15:16
1
String whitelist = "([^br|^b|^i|^div])";

Using [] creates a character class. I presume you wrote this so you could use ^ for "not", but a character class is inappropriate here. Inside square brackets, | does not mean "or"; it's just a literal pipe character. And writing div doesn't match the word div, it matches one of the three characters, d, i, or v. Negating that means "match anything except d, i, or v.

That whitelist is effectively equivalent to [^bdirv|\^] — it matches a single character that is not b, d, i, r, v, |, or ^.

String whitelist = "(?!br|b|i|div)";

If you want to exclude certain matches, what you want is negative lookahead. Leaving out the square brackets lets you use | the way you intended, as an "or" operator.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578