1

I wrote a regex that should match all dangerous HTML characters except <span style="background-color: #any-color"> and </span>:

((?!<span style="background-color: #([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})">|<\/span>)[&<>"'/])

However, it matches the extra characters that I excluded.
Here RegEx should not match quotation mark style="background-color:, but it matches: incorrect match

Where did I make a mistake?

See Regex101 demo. Here is the link to the current project:

function escapeHtml(in_) {
    return in_.replace(/((?!<span style="background-color: #([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})">|<\/span>)[&<>"'/])/g, s => {
        const entityMap = {
            '&': '&amp;',
            '<': '&lt;',
            '>': '&gt;',
            '"': '&quot;',
            '\'': '&#39;',
            '/': '&#x2F;',
        };

        return entityMap[s];
    });
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
saber-nyan
  • 322
  • 2
  • 11
  • Are you going to remove those "dangerous HTML characters"? – Wiktor Stribiżew Nov 26 '19 at 08:51
  • @WiktorStribiżew No, I'm going to escape these characters. I know that after such an operation it is still dangerous to insert them anywhere except the body of the tag, but there I am going to insert the resulting string. – saber-nyan Nov 26 '19 at 08:54
  • 1
    This is not quite possible in regular text editor without hacks. [This is a PCRE regex solution](https://regex101.com/r/FGCm2j/1). – Wiktor Stribiżew Nov 26 '19 at 08:59
  • @WiktorStribiżew I'm going to use this regex in javascript, something like this: https://gist.github.com/saber-nyan/fed9fb8057912b0a3254baf9aa14022c – saber-nyan Nov 26 '19 at 09:02
  • 1
    [You should not use regular expressions to parse HTML](https://stackoverflow.com/a/1732454/3670132). If you are going to use Javacript, why not use the built-in functions ? – Seblor Nov 26 '19 at 09:04
  • 1
    @Seblor The question is not relevant to HTML parsing. – saber-nyan Nov 26 '19 at 09:06

1 Answers1

1

Note that you may only use a regex when you have full control of the entities that appear in the plain text string.

So, if you manually add </span> and <span style="background-color: #aaff11"> like strings you may fix your code like this:

function escapeHtml(in_) {
 return in_.replace(/(<span style="background-color: #(?:[A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})">|<\/span>)|[&<>"'\/]/g, ($0,$1) => {
  const entityMap = {
   '&': '&amp;',
   '<': '&lt;',
   '>': '&gt;',
   '"': '&quot;',
   '\'': '&#39;',
   '/': '&#x2F;',
  };
  return $1 ? $1 : entityMap[$0];
 });
}
console.log(escapeHtml('<b>some test <span style="background-color: #333300">ol string!</b></span> nope <i>whoops</i> <span style="background-color: #ff0000">meh</span>'));

Else, you need to consider a DOM parsing approach. See Parse an HTML string with JS.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563