A plain JavaScript way to decode HTML entities, works on both browsers and Node

Question

How to decode HTML entities like   ' to its original character?

In browsers we can create a DOM to do the trick (see here) or we can use some libraries like he

In NodeJS we can use some third party lib like html-entities

What if we want to use plain JavaScript to do the job?

There are many similar questions and useful answers in stackoverflow but I can't find a way works both on browsers and Node.js. So I'd like to share my opinion.

I have posted my opinion as an answer below. I hope it can be a helping hand for someone. :)

https://stackoverflow.com/questions/18749591/encode-html-entities-in-javascript This should really work. Ensure encoding is the same https://github.com/mathiasbynens/he — getjackx, May 26 '17 at 06:51
The [he](https://github.com/mathiasbynens/he) package solves this by following the HTML spec instead of relying on a manually maintained dictionary. — Tobias Mühl, Nov 02 '20 at 11:58

score 68 · Accepted Answer · edited May 26 '17 at 07:41

68

There are many similar questions and useful answers in stackoverflow but I can't find a way works both on browsers and Node.js. So I'd like to share my opinion.

For html codes like   < > ' and even Chinese characters.

I suggest to use this function. (Inspired by some other answers)

function decodeEntities(encodedString) {
    var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
    var translate = {
        "nbsp":" ",
        "amp" : "&",
        "quot": "\"",
        "lt"  : "<",
        "gt"  : ">"
    };
    return encodedString.replace(translate_re, function(match, entity) {
        return translate[entity];
    }).replace(/&#(\d+);/gi, function(match, numStr) {
        var num = parseInt(numStr, 10);
        return String.fromCharCode(num);
    });
}

This implement also works in Node.js environment.

decodeEntities("哈哈 '这个'&"那个"好玩<>") //哈哈 '这个'&"那个"好玩<>

As a new user, I only have 1 reputation :(

I can't make comments or answers to existing posts so that's the only way I can do for now.

Edit 1

I think this answer works even better than mine. Although no one gave him up vote.

edited May 26 '17 at 07:41

Nick

3,231
2
28
50

answered May 26 '17 at 07:20

Henry He

947
1
7
11

5

This will miss a lot of html-entities, sunch as `”` `ü` `š` etc. The comprihensive list of all html-entities is quite long: https://www.freeformatter.com/html-entities.html – lofihelsinki Dec 01 '20 at 11:59
2

This is incorrect. Since there are two `replace`s, ")" will be decoded as "A" and not as ")" as it should be. I still upvoted but because the linked answer is correct. – Michael Schmidt Jun 29 '21 at 20:41
Worked for me to replace #8217; was looking for about 15 minutes, thanks – jawn Sep 16 '21 at 02:09
4

IMPORTANT! This is vulnerable to XSS attacks for user-inputted content. `<` is converted into `<` due to there being two separate `replace` calls. – Grant Gryczan Oct 14 '21 at 07:42
It also misses hexadecimal values, such as ` `. – Yogurt Feb 24 '23 at 15:19

A plain JavaScript way to decode HTML entities, works on both browsers and Node

1 Answers1

Linked

Related