How to escape all HTML in a string except ?

Question

I am making a chat app, and I want to escape all HTML sent to the event (except <a> tags, because links are auto-converted to HTML).

This is my escape function:

const escapeHtml = (unsafe) => {
        return unsafe.replaceAll('<', '&lt;').replaceAll('>', '&gt;');
    };

and this is the element's HTML (that is shown to the user/client):

message.innerHTML = escapeHtml(json.username) + ":<br/>" + escapeHtml(json.message);

No, both portions of the code are from my front-end index.html file. — L8R, Oct 25 '22 at 01:49
@CertainPerformance I am not trying to replace the element, just escape all HTML tags in a string except `` :) — L8R, Oct 25 '22 at 01:58
@CertainPerformance I do not believe that you need to use DOMParser. If OP has access to the DOM API (which can also be done from node.js, albeit with an NPM package), then one can easily sanitize HTML. From there it should just be a regex. Please see my answer. — Michael M., Oct 25 '22 at 02:05
Does this answer your question? [Escape html tags except few given tags](https://stackoverflow.com/questions/26106855/escape-html-tags-except-few-given-tags) — Heretic Monkey, Oct 25 '22 at 03:13

Kaiido · Accepted Answer · 2022-10-25T23:03:22.540

Note: I know this will receive some hate by future readers who came with the same question, but hear me out:

You're probably better off not doing that.

Escaping HTML correctly is already hard on itself. HTML is a very complex language with many edge-cases and is far from being "regular".
There are good tools and great solutions to escape HTML, but these will work when escaping the whole input. Trying to modify one of these solutions so that it includes your special case will inevitably bring security risks. For instance at the time I write this answer, two other answers have been posted, both of which I could perform an XSS attack on, in less than a few minutes. I'm not even a security researcher, just someone who knows HTML.

Not only does a simple tag sanitizer become complex and easy to mess, but even if you were able to make a correct one, it would need to handle attributes too (changing < to < isn't enough), because you now have HTML in your page and that (some) attributes can execute scripts:

<a href="#" onmouseover="alert('this also needs sanitization')" style="color:red">hover me</a>

And don't even think of outsmarting potential attackers, they'll always find something you didn't think of:

<button>Safe button</button>
<!-- Even the 'style' attribute needs to be sanitized, e.g one could force all your users to follow their link: -->
<a href="https://example.com" style="position:fixed;top:0;left:0;width:100vw;height:100vw"></a>

So, tweaking a sanitizer isn't the way to go, nor is to write one yourself. Then what?

What you want is to let your users write some text, and append links in there. No need for HTML to do that. You can have a totally different markup language that will define that a given sequence should be treated as a link, but that won't understand any of HTML, and moreover, that any HTML parser won't understand as being HTML.

For instance Common Mark, that we do use on this very website, or on GitHub and many other places, does just that. It's defined that the sequence [word](https://example.com) will create the anchor word and we can store this by escaping any HTML we want without any risk. And even better, you don't even need to escape the content, because now you can avoid entirely dangerous methods like setting .innerHTML and stick to safe .textContent.
But you shouldn't even worry about that either, because there are many well written Common Mark parsers and renderers that will generate just what you need directly.

I don't want markdown tho because the users don't know what markdown is

And they know how HTML works? The best is to give them a good UI where they just have to push a button which opens a prompt where they can enter the address the link should lead to, so that their currently selected text becomes the proper markdown, like most online text editors do.

I don't want markdown tho because the users don't know what markdown is — L8R, Oct 25 '22 at 11:45
And they know how HTML works? Simply provide them with a good UI where they just have to push a button which opens a prompt where they can enter the address the link should lead to, so that their currently selected text becomes the proper markdown. I mean, like any online text editor does. — Kaiido, Oct 25 '22 at 12:07

slebetman · Answer 2 · 2022-10-25T13:30:58.117

The .replaceAll() function accepts a function as the replacement argument if you need to do more advanced processing. You can use that function to decide if you want to do the replacement or not:

unsafe.replaceAll(/<([^\s>]+)(.*?)>/g,(match, tag, remainder) => {
    if (tag === 'a' || tag === '/a') {
        return `<${tag}${remainder}>`;
    }
    else {
        return `&lt${tag}${remainder}&gt`
    }
});

You can read about how to use functions as replacement from the docs: Specifying functions as the replacement.

Explanation of the regular expression:

The regexp is just looking for all tags enclosed by < and >:

<        // starts with <
(        // remember this matching group (function 2nd argument)
  [^     // anything that is not
    \s   // whitespace
    >    // or >
  ]
  +      // one or more of the above
)
(        // remember this matching group (function 3rd argument)
  .      // any character
  *      // zero or more of the above
  ?      // don't be greedy
)
>        // ends with >

Note that this regular expression expects all HTML tags to immediately start with the tag name (eg <a>). It breaks if you have tags that have whitespace before the tag name, for example:

// the regexp above does not work if your string looks like this:

hello <
         a href="/world"
      > world </a>

To fix that you can add a zero or more whitespace pattern (\s*) right after <:

/<\s*([^\s>]+?)(.*?)>/g

"Note that this regular expression expects all HTML tags to immediately start with the tag name" This is not an issue, HTML has the same expectation. However your RegExp does capture only the first letter of the tag name, you'd need to remove the `?` from it, but I can't assure that'd make it *safe*. https://jsfiddle.net/w4hr107m/ — Kaiido, Oct 25 '22 at 03:18
And indeed it's still not *safe*: https://jsfiddle.net/ptw1hjno/ — Kaiido, Oct 25 '22 at 09:48

How to escape all HTML in a string except ?

2 Answers2

You're probably better off not doing that.

So, tweaking a sanitizer isn't the way to go, nor is to write one yourself. Then what?

I don't want markdown tho because the users don't know what markdown is

Explanation of the regular expression: