What's the right way to decode a string that has special HTML entities in it?

Question

Say I get some JSON back from a service request that looks like this:

{
    "message": "We&#39;re unable to complete your request at this time."
}

I'm not sure why that apostraphe is encoded like that ('); all I know is that I want to decode it.

Here's one approach using jQuery that popped into my head:

function decodeHtml(html) {
    return $('<div>').html(html).text();
}

That seems (very) hacky, though. What's a better way? Is there a "right" way?

Possible dup of: http://stackoverflow.com/questions/5796718/html-entity-decode — jfriend00, Sep 12 '11 at 22:35

Rob W · Accepted Answer · 2014-03-04T18:55:04.417

553

This is my favourite way of decoding HTML characters. The advantage of using this code is that tags are also preserved.

function decodeHtml(html) {
    var txt = document.createElement("textarea");
    txt.innerHTML = html;
    return txt.value;
}

Example: http://jsfiddle.net/k65s3/

Input:

Entity:&nbsp;Bad attempt at XSS:<script>alert('new\nline?')</script><br>

Output:

Entity: Bad attempt at XSS:<script>alert('new\nline?')</script><br>

edited Mar 04 '14 at 18:55

answered Sep 12 '11 at 22:29

Rob W

341,306
83
791
678

2

Ah, seems like basically the same approach I took but without the jQuery dependency (which is nice). Doesn't it still seem hacky, though? Or should I be perfectly comfortable with it? – Dan Tao Sep 12 '11 at 22:33
41

Oh wait, I get it: you're using `textarea` specifically so that the tags are preserved (as you said) but HTML entities still get decoded. Pretty clever... – Dan Tao Sep 12 '11 at 22:34
3

It's acceptable. It's the best way to decode HTML. No tags are passed, unlike your original solution, which parse (thus hide) tags. – Rob W Sep 12 '11 at 22:34
1

Nice trick! I'd been using a non-`textarea` version of this for a while, and this is by far better. – Domenic Sep 12 '11 at 22:49
How safe is this with untrusted inputs? See [this comment.](http://stackoverflow.com/questions/1147359/how-to-decode-html-entities-using-jquery?answertab=votes#comment6018122_2419664) The jQuery version in the question would be susceptible. Does using textarea prevent unsafe code from actually being executed? – Andy Madge Mar 04 '14 at 18:40
@AndyMadge "HTML" inside a textarea is not executed. See example in revised answer. – Rob W Mar 04 '14 at 18:55
how does this work with older versions of IE (thinking 7 and 8). a lot of our base users are still mired with these archaic browsers. and this wonderful function throws back `SCRIPT601: Unknown runtime error` – kolin Mar 23 '15 at 15:30
@kolin You could strip the HTML tags using `html = html.replace(/<[^>]*>/g, '');` to work around that problem. This solution goes badly with input like ``, but for most usual inputs, it is sufficient. – Rob W Mar 23 '15 at 15:56
Doesn't work for me. `txt.value` is always an empty string. – Otto Abnormalverbraucher Aug 20 '15 at 13:20
Are we supposed to remove the `txt` element created by the decodeHtml function? Or is it not attached to the DOM so we don't care? – Leonardo Sep 29 '15 at 08:23
1

@Leonardo It is never attached to the document. – Rob W Sep 29 '15 at 12:56
This solution assumes that browsers decode character references correctly, [which is not true](https://github.com/mathiasbynens/he/issues/42#issuecomment-194716815) for older versions of WebKit and IE, and even for current versions of Edge. Believe it or not, Edge even decodes ` ` incorrectly, making the example in your answer fail to produce the right answer in that browser. [Don’t use the DOM to decode HTML entities.](https://stackoverflow.com/a/35915311/96656) – Mathias Bynens Mar 10 '16 at 11:38
@MathiasBynens Interesting. Do you know whether replacing all occurrences of < with `<` and then wrapping the result in `` & `` produces the correctly decoded entities? Maybe `txt.innerHTML = ...` has a special treatment (because it's inside a textarea), but I'd expect the HTML parser to correctly parse HTML inside a, say, `
` element.
– Rob W Mar 10 '16 at 11:42
@RobW The bug is not due to `innerHTML`/`textContent` usage, but rather due to the browser engine having incorrect (per the spec) character tables. [My old test page](https://mathias.html5.org/tests/html/named-character-references/) uses a `
` element.
– Mathias Bynens Mar 10 '16 at 13:20
@Mathias Firefox 45 and Chrome 49 pass your tests without issues. Could you add tests results + version numbers to your answer/repo, so that it's more obvious whether someone really needs `he`? `he`s advantage is that it is deterministic across all environments, but if eventually all browsers catch up (so far you've only said "Edge", without version), then that is not relevant any more, especially since the method in my answer performs better. – Rob W Mar 11 '16 at 08:43
@RobW The first link in my first comment points to the bugs found through those tests. – Mathias Bynens Mar 11 '16 at 14:00
@MathiasBynens I don't see *relevant* test cases in your link. The [WebKit bug](https://bugs.webkit.org/show_bug.cgi?id=74826) was fixed 4 years ago. I tried to reproduce the [IE bug](https://connect.microsoft.com/IE/feedback/details/743819) you're referring to (` ` is supposedly decoded as a space instead of a non-breaking space character), but the result is good: ` `'s character code is 160. https://jsfiddle.net/49j0t5mt/ (looks good in IE 9, 10, 11, Edge 12, Edge 13). – Rob W Mar 11 '16 at 15:46
`$('').html("Chris' corner").text();` will this attach a `textarea` to document? – Manish Kumar Feb 03 '17 at 13:18
@manish No. `$('').html("Chris' corner").text();` will not attach a textarea to the document. – Rob W Feb 03 '17 at 13:25
`$('')` creates a new element right? – Manish Kumar Feb 03 '17 at 13:44
@manish. Yes. Please look up the documentation for jQuery. – Rob W Feb 03 '17 at 14:19
@RobW, What is the advantage of using 'document.createElement("textarea");' over 'document.createElement("div");' apart from being the HTML tags does not executed inside textarea? – Mohan Krishnan Nov 26 '18 at 13:27
Is there any way to have a React JS version of this? – try_catch Jun 20 '21 at 04:20

score 177 · Answer 2 · edited Oct 22 '22 at 22:07

177

Don’t use the DOM to do this if you care about legacy compatibility. Using the DOM to decode HTML entities (as suggested in the currently accepted answer) leads to differences in cross-browser results on non-modern browsers.

For a robust & deterministic solution that decodes character references according to the algorithm in the HTML Standard, use the he library. From its README:

he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would, has an extensive test suite, and — contrary to many other JavaScript solutions — he handles astral Unicode symbols just fine. An online demo is available.

Here’s how you’d use it:

he.decode("We&#39;re unable to complete your request at this time.");
→ "We're unable to complete your request at this time."

Disclaimer: I'm the author of the he library.

See this Stack Overflow answer for some more info.

edited Oct 22 '22 at 22:07

Jason

3,379
25
32

answered Mar 10 '16 at 11:33

Mathias Bynens

144,855
52
216
248

19

I was in NodeJS so for me this was the only available solution. – Augustin Riedinger Jun 23 '17 at 13:14
I was writing a browser plugin that scraped the page for stuff, so the dom based solution was not an issue. It depends on context. – Ray Foss Jul 28 '17 at 16:28
3

how significant "leads to differences in cross-browser results." is? In which browser the result can be very different? would you please give me exact example (which is the most significant in your mind)? I don't want to use excessive third party library, so I would like to know about it first. – Taufik Nur Rahmanda Dec 29 '17 at 03:29
1

@TaufikNurRahmanda The link it points to answers that question. – Mathias Bynens Jan 16 '18 at 10:14
1

This one should be the right answer. Works better than lodash and underscore. – Tony Wang Apr 27 '20 at 04:55
10

This library is 30KB gzipped... I don't want to get new libraries for every tiny problem I have to solve in JS. – RaisinBranCrunch Oct 14 '21 at 20:52
Also pointing out that there is no longer differences in cross-browser results. The issues referenced here are extremely outdated. This answer should be amended or marked as outdated. – Jason Oct 22 '22 at 22:05

score 45 · Answer 3 · edited Oct 13 '16 at 21:59

45

If you don't want to use html/dom, you could use regex. I haven't tested this; but something along the lines of:

function parseHtmlEntities(str) {
    return str.replace(/&#([0-9]{1,3});/gi, function(match, numStr) {
        var num = parseInt(numStr, 10); // read num as normal number
        return String.fromCharCode(num);
    });
}

[Edit]

Note: this would only work for numeric html-entities, and not stuff like &oring;.

[Edit 2]

Fixed the function (some typos), test here: http://jsfiddle.net/Be2Bd/1/

edited Oct 13 '16 at 21:59

cchamberlain

17,444
7
59
72

answered Sep 12 '11 at 22:33

Alxandr

12,345
10
59
95

7

What about `&` and other named entities? These are still not parsed in this implementation. – Rob W Sep 12 '11 at 22:36
5

I already commented on the fact that they won't be parsed. To parse those, you'd need a hashmap of some sort (lookup). However, if this code is autogenerated (per say), then there is a chance that it always will return the numeric value. I only provided a pure-js way of doing this (works without DOM), not saying it solves the general problem, but more the specific one. – Alxandr Sep 12 '11 at 22:45
I had to server-side decode some ASP `Server.HTMLEncode`d string, and try this without a `document`. That’s a nifty solution, thank you for it! – dakab May 19 '14 at 06:41
1

Not working with > < – Arthur Ronconi Dec 23 '16 at 13:25
1

I've just used this JSFiddle with one slight change, `{1,3}` to `{1,4}` which also allows for more characters, such as en dashes (`–`). For future reference, anyone else have this for other tags, such as `>` ? – davewoodhall Oct 15 '17 at 23:40
2

In the regex match `/(pattern)/gi` the `i` suffix to ignore case isn't needed, as this is only going to match _numbers_. Along with davewoodhall's comment, I'm using `/([0-9]{1,4});/g` – Stephen P Feb 28 '18 at 01:07
This is working well for some text with numeric HTML entities returned from a REST API service! I'm using it in an Angular Pipe class I made. – RcoderNY Oct 21 '20 at 10:48

score 38 · Answer 4 · edited Aug 25 '22 at 16:13

38

There's JS function to deal with &#xxxx styled entities:
function at GitHub

// encode(decode) html text into html entity
var decodeHtmlEntity = function(str) {
  return str.replace(/&#(\d+);/g, function(match, dec) {
    return String.fromCharCode(dec);
  });
};

var encodeHtmlEntity = function(str) {
  var buf = [];
  for (var i=str.length-1;i>=0;i--) {
    buf.unshift(['&#', str[i].charCodeAt(), ';'].join(''));
  }
  return buf.join('');
};

var entity = '&#39640;&#32423;&#31243;&#24207;&#35774;&#35745;';
var str = '高级程序设计';

let element = document.getElementById("testFunct");
element.innerHTML = (decodeHtmlEntity(entity));

console.log(decodeHtmlEntity(entity) === str);
console.log(encodeHtmlEntity(str) === entity);
// output:
// true
// true

<div><span id="testFunct"></span></div>

edited Aug 25 '22 at 16:13

B Kalra

821
6
17

answered Apr 23 '15 at 13:13

hypers

1,045
1
12
30

5

Thank you. This function worked beautifully in a #gatsbyjs application where `document` could not be defined during static HTML builds. – David Gaskin Mar 13 '19 at 07:34
3

This is how it should be done! – silver Mar 09 '20 at 05:42
2

This is the way! No lib cluttering, no DOM manipulation, no html injection. – Edeph Jul 23 '21 at 11:48
This does not fix hex-encoding, like å – Pål Thingbø Jan 14 '23 at 22:10

Jason Williams · Answer 5 · 2018-09-12T15:38:11.420

36

jQuery will encode and decode for you.

function htmlDecode(value) {
  return $("<textarea/>").html(value).text();
}

function htmlEncode(value) {
  return $('<textarea/>').text(value).html();
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
   $("#encoded")
  .text(htmlEncode("<img src onerror='alert(0)'>"));
   $("#decoded")
  .text(htmlDecode("&lt;img src onerror='alert(0)'&gt;"));
});
</script>

<span>htmlEncode() result:</span><br/>
<div id="encoded"></div>
<br/>
<span>htmlDecode() result:</span><br/>
<div id="decoded"></div>

edited Sep 12 '18 at 15:38

answered Mar 21 '16 at 17:43

Jason Williams

2,740
28
36

1

For reference, any "element" will work here. There is nothing magic about the textarea that does the work here. But that said, if you're using jQuery already, I always employ this approach with fantastic results. – pim Mar 21 '16 at 20:54
9

I would disagree. Textarea provides security which other elements, like divs, will not. If you use a div instead of a textarea, any non-encoded javascript in the input will be rendered in the browser. A textarea gets around this by treating the input as text... not as html. I haven't tried other elements to know how they behave. – Jason Williams Mar 23 '16 at 12:53
And just to further clarify: if you DO want the html to render in the browser after conversion, wrap it in an element that is not a text input. – Spartacus Jul 26 '17 at 15:33

tldr · Answer 6 · 2018-12-15T16:22:10.277

18

_.unescape does what you're looking for

https://lodash.com/docs/#unescape

edited Dec 15 '18 at 16:22

answered Apr 03 '17 at 22:06

tldr

11,924
15
75
120

3

it just replaces a few encoded characters - if you've got e.g. a it stays the way it is. – xtools Jan 16 '18 at 10:41
e' is not in the list? This only replaces &, <, >, ", ` and ' – Jquestions Dec 14 '18 at 11:25
updated link to lodash.unescape, which handles ' – tldr Dec 15 '18 at 16:22

score 0 · Answer 7 · answered Dec 22 '16 at 13:58

0

This is so good answer. You can use this with angular like this:

 moduleDefinitions.filter('sanitize', ['$sce', function($sce) {
    return function(htmlCode) {
        var txt = document.createElement("textarea");
        txt.innerHTML = htmlCode;
        return $sce.trustAsHtml(txt.value);
    }
}]);

answered Dec 22 '16 at 13:58

kodmanyagha

932
12
20

2

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Nic3500 Aug 29 '18 at 16:59

What's the right way to decode a string that has special HTML entities in it?

7 Answers7

[Edit]

[Edit 2]

Linked

Related