Decoding special characters into HTML characters

Question

I have been searching the web for a javascript function that turns something like this:

Hi.&amp;nbsp; I will show &lt;span style="font-weight: bold;"&gt;HTML&lt;/span&gt;.

Into this:

Hi.&nbsp; I will show <span style="font-weight: bold;">HTML</span>.

I am using this method:

htmlDecode: function (input) {
    var doc = new DOMParser().parseFromString(input, "text/html");
    return doc.documentElement.textContent;
}

And it works the first time. But if I try it again, on text like this:

Hi.&nbsp; I will show <span style="font-weight: bold;">HTML</span>

It strips out all the html and just leaves me with:

Hi. I will show HTML.

I only want the method to change this:

Hi.&amp;nbsp; I will show &lt;span style="font-weight: bold;"&gt;HTML&lt;/span&gt;.

Into this:

Hi.&nbsp; I will show <span style="font-weight: bold;">HTML</span>.

I don't want it to totally strip out the HTML.

Is there a way to do that?

Thanks!

Does this answer your question? [Unescape HTML entities in Javascript?](https://stackoverflow.com/questions/1912501/unescape-html-entities-in-javascript) — shreyasm-dev, Sep 25 '20 at 16:53
Why then do you call it a second time on the result of the first call? Just call it once only? — trincot, Sep 25 '20 at 16:53
@trincot I know but this function is fired whenever text is saved to the backend. So I guess thought it would be ok on everyting. But how would I tell it not to fire on text that is already ok? Thanks! — SkyeBoniwell, Sep 25 '20 at 17:01

trincot · Accepted Answer · 2020-09-25T17:51:21.440

You could check if the parsed result contains DOM elements. If so, then it means the decoding went one step too far, and the original value should be returned:

function htmlDecode (input) {
    let doc = new DOMParser().parseFromString(input, "text/html");
    let body = doc.querySelector("body");
    return body.children.length ? input : body.textContent;
}

let s = 'Hi.&amp;nbsp; I will show &lt;span style="font-weight: bold;"&gt;HTML&lt;/span&gt;.';
s = htmlDecode(s);
console.log(s); // decoded
s = htmlDecode(s); // apply on the result...
console.log(s); // ... no change

s = htmlDecode("Hi.&nbsp; This is normal text.");
console.log(s);

An additional check

Another assumption could be that it should be possible to decode the result of a first decoding and get a different result again. If it produces twice the same result, then the original input should be returned.

function htmlDecode (input) {
    let parser = new DOMParser();
    let doc = parser.parseFromString(input, "text/html");
    let { textContent, children } = doc.querySelector("body");
    if (children.length) return input;
    doc = parser.parseFromString(textContent, "text/html");
    if (doc.querySelector("body").textContent === textContent) return input;
    return textContent;
}

let s = 'Hi.&amp;nbsp; I will show &lt;span style="font-weight: bold;"&gt;HTML&lt;/span&gt;.';
s = htmlDecode(s);
console.log(s); // decoded
s = htmlDecode(s); // apply on the result...
console.log(s); // ... no change

s = "Hi.&amp;nbsp; This is normal text.";
s = htmlDecode(s);
console.log(s); // decoded
s = htmlDecode(s); // apply on the result...
console.log(s); // ... no change

Hi! This is working a lot better, but I noticed if I have text like this: `Hi. This is normal text.`, and then I run that function above, it will be a `&amp` in front of the ` ` so that the text then looks like this: `Hi. This is normal text.` — SkyeBoniwell, Sep 25 '20 at 17:22
I cannot reproduce that. If you pass "Hi. This is normal text." as argument to this function, the ` ` will be replaced with a non breaking space, so that you get "Hi. This is normal text.". I have added this case to my snippet so you can see it run. — trincot, Sep 25 '20 at 17:24
But it is true that if you decode "Hi. This is normal text." and then decode the result of that, it will still resolve entities further. The thing is: you cannot know whether the original string was really "Hi. This is normal text." or ""Hi. This is normal text.". Both could be true. Without further information, it is impossible to distinguish both cases. It could even be that the original text was "Hi.  This is normal text." ... etc — trincot, Sep 25 '20 at 17:35
I extended my answer with another heuristic, so to capture this case as well. — trincot, Sep 25 '20 at 17:51

score 0 · Answer 2 · answered Sep 25 '20 at 16:50

0

You can create a div, set it's innerHTML, and then retrieve it's innerText.

function htmlDecode(text) {
  var div = document.createElement('div')
  div.innerHTML = text
  return div.innerText
}

console.log(htmlDecode('Hi.&amp;nbsp; I will show &lt;span style="font-weight: bold;"&gt;HTML&lt;/span&gt;.'))

answered Sep 25 '20 at 16:50

shreyasm-dev

2,711
5
16
34

1

This will allow for arbitrary code execution. DOMParser is probably a better idea – CertainPerformance Sep 25 '20 at 16:52
@CertainPerformance thanks! That's exactly why I'm using DOMParser. :) – SkyeBoniwell Sep 25 '20 at 16:53

Decoding special characters into HTML characters

2 Answers2

An additional check