25

How to decode HTML entities like   ' to its original character?

In browsers we can create a DOM to do the trick (see here) or we can use some libraries like he

In NodeJS we can use some third party lib like html-entities

What if we want to use plain JavaScript to do the job?

There are many similar questions and useful answers in stackoverflow but I can't find a way works both on browsers and Node.js. So I'd like to share my opinion.

I have posted my opinion as an answer below. I hope it can be a helping hand for someone. :)

Henry He
  • 947
  • 1
  • 7
  • 11
  • 3
    https://stackoverflow.com/questions/18749591/encode-html-entities-in-javascript This should really work. Ensure encoding is the same https://github.com/mathiasbynens/he – getjackx May 26 '17 at 06:51
  • The [he](https://github.com/mathiasbynens/he) package solves this by following the HTML spec instead of relying on a manually maintained dictionary. – Tobias Mühl Nov 02 '20 at 11:58

1 Answers1

68

There are many similar questions and useful answers in stackoverflow but I can't find a way works both on browsers and Node.js. So I'd like to share my opinion.

For html codes like   < > ' and even Chinese characters.

I suggest to use this function. (Inspired by some other answers)

function decodeEntities(encodedString) {
    var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
    var translate = {
        "nbsp":" ",
        "amp" : "&",
        "quot": "\"",
        "lt"  : "<",
        "gt"  : ">"
    };
    return encodedString.replace(translate_re, function(match, entity) {
        return translate[entity];
    }).replace(/&#(\d+);/gi, function(match, numStr) {
        var num = parseInt(numStr, 10);
        return String.fromCharCode(num);
    });
}

This implement also works in Node.js environment.

decodeEntities("&#21704;&#21704;&nbsp;&#39;&#36825;&#20010;&#39;&amp;&quot;&#37027;&#20010;&quot;&#22909;&#29609;&lt;&gt;") //哈哈 '这个'&"那个"好玩<>

As a new user, I only have 1 reputation :(

I can't make comments or answers to existing posts so that's the only way I can do for now.

Edit 1

I think this answer works even better than mine. Although no one gave him up vote.

Nick
  • 3,231
  • 2
  • 28
  • 50
Henry He
  • 947
  • 1
  • 7
  • 11
  • 5
    This will miss a lot of html-entities, sunch as `”` `ü` `š` etc. The comprihensive list of all html-entities is quite long: https://www.freeformatter.com/html-entities.html – lofihelsinki Dec 01 '20 at 11:59
  • 2
    This is incorrect. Since there are two `replace`s, "&#41;" will be decoded as "A" and not as ")" as it should be. I still upvoted but because the linked answer is correct. – Michael Schmidt Jun 29 '21 at 20:41
  • Worked for me to replace #8217; was looking for about 15 minutes, thanks – jawn Sep 16 '21 at 02:09
  • 4
    IMPORTANT! This is vulnerable to XSS attacks for user-inputted content. `&#60;` is converted into `<` due to there being two separate `replace` calls. – Grant Gryczan Oct 14 '21 at 07:42
  • It also misses hexadecimal values, such as ` `. – Yogurt Feb 24 '23 at 15:19