Note: I know this will receive some hate by future readers who came with the same question, but hear me out:
You're probably better off not doing that.
Escaping HTML correctly is already hard on itself. HTML is a very complex language with many edge-cases and is far from being "regular".
There are good tools and great solutions to escape HTML, but these will work when escaping the whole input. Trying to modify one of these solutions so that it includes your special case will inevitably bring security risks. For instance at the time I write this answer, two other answers have been posted, both of which I could perform an XSS attack on, in less than a few minutes. I'm not even a security researcher, just someone who knows HTML.
Not only does a simple tag sanitizer become complex and easy to mess, but even if you were able to make a correct one, it would need to handle attributes too (changing <
to <
isn't enough), because you now have HTML in your page and that (some) attributes can execute scripts:
<a href="#" onmouseover="alert('this also needs sanitization')" style="color:red">hover me</a>
And don't even think of outsmarting potential attackers, they'll always find something you didn't think of:
<button>Safe button</button>
<!-- Even the 'style' attribute needs to be sanitized, e.g one could force all your users to follow their link: -->
<a href="https://example.com" style="position:fixed;top:0;left:0;width:100vw;height:100vw"></a>
So, tweaking a sanitizer isn't the way to go, nor is to write one yourself. Then what?
What you want is to let your users write some text, and append links in there. No need for HTML to do that. You can have a totally different markup language that will define that a given sequence should be treated as a link, but that won't understand any of HTML, and moreover, that any HTML parser won't understand as being HTML.
For instance Common Mark, that we do use on this very website, or on GitHub and many other places, does just that. It's defined that the sequence [word](https://example.com)
will create the anchor word and we can store this by escaping any HTML we want without any risk. And even better, you don't even need to escape the content, because now you can avoid entirely dangerous methods like setting .innerHTML
and stick to safe .textContent
.
But you shouldn't even worry about that either, because there are many well written Common Mark parsers and renderers that will generate just what you need directly.
I don't want markdown tho because the users don't know what markdown is
And they know how HTML works? The best is to give them a good UI where they just have to push a button which opens a prompt where they can enter the address the link should lead to, so that their currently selected text becomes the proper markdown, like most online text editors do.