290

I have some JavaScript code that communicates with an XML-RPC backend. The XML-RPC returns strings of the form:

<img src='myimage.jpg'>

However, when I use the JavaScript to insert the strings into HTML, they render literally. I don't see an image, I literally see the string:

<img src='myimage.jpg'>

My guess is that the HTML is being escaped over the XML-RPC channel.

How can I unescape the string in JavaScript? I tried the techniques on this page, unsuccessfully: http://paulschreiber.com/blog/2008/09/20/javascript-how-to-unescape-html-entities/

What are other ways to diagnose the issue?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Joseph Turian
  • 15,430
  • 14
  • 47
  • 62
  • 2
    As strings containing HTML entities are something different than [`escape`](https://developer.mozilla.org/en/DOM/window.escape)d or [URI encoded strings](https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/encodeURIComponent), those functions won't work. – Marcel Korpel Sep 13 '10 at 13:15
  • The huge function included in this article seems to work fine: http://blogs.msdn.com/b/aoakley/archive/2003/11/12/49645.aspx I don't think that's the most clever solution but works. – mati Sep 13 '10 at 12:52
  • 2
    @Matias note that new named entities have been added to HTML (e.g. via the HTML 5 spec) since that function was authored in 2003 - for instance, it doesn't recognise `&zopf;`. This is a problem with an evolving spec; as such, you should pick a tool that's actually being maintained to solve it with. – Mark Amery Feb 19 '17 at 15:03
  • Possible duplicate of [How to decode HTML entities using jQuery?](https://stackoverflow.com/questions/1147359/how-to-decode-html-entities-using-jquery) – lucascaro Nov 13 '18 at 19:23
  • I've just realized how easy it is to confuse this question with encoding HTML entities. I've just realized I accidentally posted an answer for the wrong question on this question! I've deleted it, though. – shreyasm-dev Sep 25 '20 at 16:59

34 Answers34

669

Most answers given here have a huge disadvantage: if the string you are trying to convert isn't trusted then you will end up with a Cross-Site Scripting (XSS) vulnerability. For the function in the accepted answer, consider the following:

htmlDecode("<img src='dummy' onerror='alert(/xss/)'>");

The string here contains an unescaped HTML tag, so instead of decoding anything the htmlDecode function will actually run JavaScript code specified inside the string.

This can be avoided by using DOMParser which is supported in all modern browsers:

function htmlDecode(input) {
  var doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

console.log(  htmlDecode("&lt;img src='myimage.jpg'&gt;")  )    
// "<img src='myimage.jpg'>"

console.log(  htmlDecode("<img src='dummy' onerror='alert(/xss/)'>")  )  
// ""

This function is guaranteed to not run any JavaScript code as a side-effect. Any HTML tags will be ignored, only text content will be returned.

Compatibility note: Parsing HTML with DOMParser requires at least Chrome 30, Firefox 12, Opera 17, Internet Explorer 10, Safari 7.1 or Microsoft Edge. So all browsers without support are way past their EOL and as of 2017 the only ones that can still be seen in the wild occasionally are older Internet Explorer and Safari versions (usually these still aren't numerous enough to bother).

vsync
  • 118,978
  • 58
  • 307
  • 400
Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126
  • 38
    I think this answer is the best because it mentioned the XSS vulnerability. – Константин Ван Dec 30 '15 at 18:04
  • 2
    Note that (according to your reference) `DOMParser` did not support `"text/html"` before Firefox 12.0, and [there are still some latest versions of browsers that do not even support `DOMParser.prototype.parseFromString()`](http://caniuse.com/#search=domparser). According to your reference, `DOMParser` is still an experimental technology, and the stand-ins use the `innerHTML` property which, as you also pointed out in response to [my approach](http://stackoverflow.com/a/12585218/855543), has this XSS vulnerability (which ought to be fixed by browser vendors). – PointedEars Feb 28 '16 at 08:53
  • 5
    @PointedEars: Who cares about Firefox 12 in 2016? The problematic ones are Internet Explorer up to 9.0 and Safari up to 7.0. If one can afford not supporting them (which will hopefully be everybody soon) then DOMParser is the best choice. If not - yes, processing entities only would be an option. – Wladimir Palant Feb 28 '16 at 12:43
  • 1. Please read my entire comment. 2. You do not have to use either one or the other, you can do feature tests. 3. That does not change the fact that if `DOMParser` is not available, it does not suffice to process “only entities”. – PointedEars Feb 28 '16 at 13:12
  • @PointedEars browser vendors cannot "fix" innerHTML, because it's working exactly as expected: you give it some HTML and the browser renders it. The problem, as they say, is between keyboard and chair: namely giving it pieces of HTML that don't come from the same website or from another trusted source. – Tobia Aug 04 '16 at 10:39
  • @Tobia You’re wrong. If “`script` elements inserted using innerHTML do not execute when they are inserted” (see reference), then at least certain event-handler values should not, too. – PointedEars Aug 07 '16 at 12:24
  • 4
    @PointedEars: ` – Wladimir Palant Aug 07 '16 at 14:48
  • Since your code is most likely to reuse this many times, avoid using "new" with new DOMParser(). Just create it once and reference a member instance. – Johann Jul 27 '17 at 05:45
  • @AndroidDev: Premature optimization is the root of all evil. I don't want to make assumptions about whether and how this code will be used, and I don't mean to encourage cargo cult programming either. – Wladimir Palant Jul 29 '17 at 14:45
  • do you have any idea about why the example `newElement.innerHTML = "";` does not work? I tried it in several browsers, including IE6, tried to add an invocation of `body.appendChild(newElement)`, but still was not able to see the alert. – d.k Nov 14 '17 at 10:41
  • @user907860: What is `newElement`? If it is something like ` – Wladimir Palant Nov 14 '17 at 12:57
  • the `newElement` is the newly created div, from the accepted answer, where it was named `e` : `var e = document.createElement('div');`. No, no "about:blank", I'm using a regular webpage on a local server. Actually, thank you for the response, by now it's enough for me, since if it is something unexpected, I'll probably ask a dedicated question a bit later – d.k Nov 14 '17 at 14:06
  • This is the solution that does work when evaluating a string within an SVG document that is encoded, in IE11. The
    solution does NOT work, as no child nodes are created when the inner text is set. So I'm for this solution as it works more broadly. The solution needs to work without any outside frameworks - natively with what the browsers provide; this fits the bill.
    – Minok May 02 '18 at 00:01
  • Thanks for sharing this, but using this answer didn't help convert an escaped SVG string. Would you mind taking a peek? Thanks so much: https://stackoverflow.com/questions/54003323/svg-converting-escaped-svg-string-and-appending-to-body-does-nothing – Crashalot Jan 02 '19 at 08:35
  • @Crashalot: `DOMParser` can parse XML code as well, you merely need to change the MIME type. Somebody already pointed that out to you. – Wladimir Palant Jan 02 '19 at 09:37
  • This code is **extremely** slow! See [my answer](https://stackoverflow.com/a/55142351/5286034) where I provided proofs. – Илья Зеленько Mar 13 '19 at 12:52
  • 2
    @ИльяЗеленько: Do you plan to use this code in a tight loop or why does the performance matter? Your answer is again vulnerable to XSS, was it really worth it? – Wladimir Palant Mar 13 '19 at 19:39
  • Thank you Wladimir Palant! I've been looking for this, appreciate the example & explanation. – user752746 Apr 10 '19 at 22:33
  • Worked for me :) – Naveen Kumar V Sep 18 '20 at 16:22
  • Note: This answer also removes HTML tags. In case you want to _decode entities only_ and keep the tags you can use `return doc.body.innerHTML` (instead of `return doc.documentElement.textContent`). – Peter T. Aug 03 '21 at 06:53
315

Do you need to decode all encoded HTML entities or just &amp; itself?

If you only need to handle &amp; then you can do this:

var decoded = encoded.replace(/&amp;/g, '&');

If you need to decode all HTML entities then you can do it without jQuery:

var elem = document.createElement('textarea');
elem.innerHTML = encoded;
var decoded = elem.value;

Please take note of Mark's comments below which highlight security holes in an earlier version of this answer and recommend using textarea rather than div to mitigate against potential XSS vulnerabilities. These vulnerabilities exist whether you use jQuery or plain JavaScript.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
LukeH
  • 263,068
  • 57
  • 365
  • 409
  • 20
    Beware! This is potentially insecure. If `encoded=''` then the snippet above will show an alert. This means if your encoded text is coming from user input, decoding it with this snippet may present an XSS vulnerability. – Mark Amery Jul 10 '15 at 20:39
  • @MarkAmery I not a security expert, but it looks like if you immediate set the div to `null` after getting the text, the alert in the img isn't fired - http://jsfiddle.net/Mottie/gaBeb/128/ – Mottie Jul 17 '15 at 16:53
  • 4
    @Mottie note sure which browser that worked for you in, but the `alert(1)` still fires for me on Chrome on OS X. If you want a safe variant of this hack, try [using a `textarea`](http://stackoverflow.com/a/31350391/1709587). – Mark Amery Jul 17 '15 at 16:58
  • +1 for the simple regexp replace alternative for just one kind of html entity. Do use this if you are expecting html data being interpolated from, say, a python flask app to a template. – OzzyTheGiant Mar 01 '17 at 21:18
  • 2
    How to do this on Node server? – Mohammad Kermani Jun 27 '18 at 10:51
  • @MohammadKermani: [`he`, `entities` and `html-entities`](https://github.com/mathiasbynens/he/issues/64#issuecomment-652124730), but this question is a duplicate of https://stackoverflow.com/questions/1912501/unescape-html-entities-in-javascript – Dan Dascalescu Jul 01 '20 at 01:15
  • Please note that using a `` is still problematic. – Wladimir Palant Oct 22 '21 at 18:07
  • This fails on Firefox if there is an inline style with the `font-family` set, because the font's name is put in quotation marks, which are escaped, so the resulting string will look like this: `style="font-family: "Roboto";"` – Waruyama Apr 18 '23 at 06:30
200

EDIT: You should use the DOMParser API as Wladimir suggests, I edited my previous answer since the function posted introduced a security vulnerability.

The following snippet is the old answer's code with a small modification: using a textarea instead of a div reduces the XSS vulnerability, but it is still problematic in IE9 and Firefox.

function htmlDecode(input){
  var e = document.createElement('textarea');
  e.innerHTML = input;
  // handle case of empty input
  return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}

htmlDecode("&lt;img src='myimage.jpg'&gt;"); 
// returns "<img src='myimage.jpg'>"

Basically I create a DOM element programmatically, assign the encoded HTML to its innerHTML and retrieve the nodeValue from the text node created on the innerHTML insertion. Since it just creates an element but never adds it, no site HTML is modified.

It will work cross-browser (including older browsers) and accept all the HTML Character Entities.

EDIT: The old version of this code did not work on IE with blank inputs, as evidenced here on jsFiddle (view in IE). The version above works with all inputs.

UPDATE: appears this doesn't work with large string, and it also introduces a security vulnerability, see comments.

Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126
Christian C. Salvadó
  • 807,428
  • 183
  • 922
  • 838
  • Got it, you changed to ', so let me delete my comment back, thx, its working great, +1 – YOU Dec 16 '09 at 05:41
  • 1
    @S.Mark: `'` doesn't belongs to the HTML 4 Entities, that's why! http://www.w3.org/TR/html4/sgml/entities.html http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/ – Christian C. Salvadó Dec 16 '09 at 05:48
  • 2
    See also @kender's note about the poor security of this approach. – Joseph Turian Dec 16 '09 at 20:52
  • 2
    See my note to @kender about the poor testing he did ;) – Roatin Marth Dec 16 '09 at 21:08
  • See this related post in SO: http://stackoverflow.com/questions/1090056/how-to-unescape-html-in-javascript/1090461#1090461 ... looks like using innerHTML is not the way to go for security reasons. – Tom Auger Dec 22 '10 at 23:13
  • @CMS how do I do the opposite of this? – Adam Lynch Jun 07 '11 at 15:51
  • @CMS Nevermind. Sorted my problem by encoding to HTML entities in PHP and then using your function in JS to decode – Adam Lynch Jun 08 '11 at 14:51
  • Some jsperf tests: http://jsperf.com/decodehtmlclone if you are decoding strings in a loop you might consider creating only once the "div" outside of the decode function. – corbacho Feb 07 '13 at 10:39
  • Regarding @TomAuger security comment - the above code does not add the div to DOM, so nothing is rendered. It safely converts un-escapes HTML elements. – Blazes Jul 24 '13 at 15:56
  • This actually doesn't work for very long strings, above 65536 chars in Chrome v39. Then Chrome splits the contents into many `e.childNodes[*]`, so one needs to iterate over them. I added an answer that does that, see: http://stackoverflow.com/a/27546437/694469 – KajMagnus Dec 18 '14 at 12:31
  • I think this will cause a slow memory leak. You probably want to store the result, remove the element you create, and then return the stored result. – IAmNaN Mar 30 '15 at 18:19
  • ok, I know that SO doesn't like "thanks!" and "me too!", but you saved my day: your code also works for reading javascript inside a
     element and evaluating by inserting  a script element with the javascript as code. Which is what I spent hours trying to do...
    – user1251840 Aug 06 '15 at 16:19
  • 31
    This function is a security hazard, JavaScript code will run even despite the element not being added to the DOM. So this is only something to use if the input string is trusted. I added [my own answer](https://stackoverflow.com/a/34064434/785541) explaining the issue and providing a secure solution. As a side-effect, the result isn't being cut off if multiple text nodes exist. – Wladimir Palant Dec 03 '15 at 11:13
  • This does not work if the string is already unescaped. In my use case, sometimes the string is escaped, and sometimes it is not. So I would want a method that can take any string and decode it. Is that possible? – Kousha Jul 05 '16 at 03:59
  • For angular, you can wrap this in a filter, like in http://stackoverflow.com/questions/31412551/html-encoded-string-not-translating-correctly-in-angularjs/38769171#38769171. – Urb Gim Tam Aug 04 '16 at 13:49
  • @CMS would it be possible for you to either update the answer so that it does not propose unsafe code, or delete it so that the next answer becomes the top answer? I hope I'm not sounding rude -- I'd like to minimize the risk that potentially risky code gets copy-pasted by someone without reading the fine print. I find it scary to link to this thread as it is now. – Kos Feb 25 '18 at 14:34
  • 1
    This doesn't work if JS is not running in the browser, i.e. with Node. – Mattia Rasulo Mar 31 '21 at 08:34
110

A more modern option for interpreting HTML (text and otherwise) from JavaScript is the HTML support in the DOMParser API (see here in MDN). This allows you to use the browser's native HTML parser to convert a string to an HTML document. It has been supported in new versions of all major browsers since late 2014.

If we just want to decode some text content, we can put it as the sole content in a document body, parse the document, and pull out the its .body.textContent.

var encodedStr = 'hello &amp; world';

var parser = new DOMParser;
var dom = parser.parseFromString(
    '<!doctype html><body>' + encodedStr,
    'text/html');
var decodedString = dom.body.textContent;

console.log(decodedString);

We can see in the draft specification for DOMParser that JavaScript is not enabled for the parsed document, so we can perform this text conversion without security concerns.

The parseFromString(str, type) method must run these steps, depending on type:

  • "text/html"

    Parse str with an HTML parser, and return the newly created Document.

    The scripting flag must be set to "disabled".

    NOTE

    script elements get marked unexecutable and the contents of noscript get parsed as markup.

It's beyond the scope of this question, but please note that if you're taking the parsed DOM nodes themselves (not just their text content) and moving them to the live document DOM, it's possible that their scripting would be reenabled, and there could be security concerns. I haven't researched it, so please exercise caution.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Jeremy
  • 1
  • 85
  • 340
  • 366
55

Matthias Bynens has a library for this: https://github.com/mathiasbynens/he

Example:

console.log(
    he.decode("J&#246;rg &amp J&#xFC;rgen rocked to &amp; fro ")
);
// Logs "Jörg & Jürgen rocked to & fro"

I suggest favouring it over hacks involving setting an element's HTML content and then reading back its text content. Such approaches can work, but are deceptively dangerous and present XSS opportunities if used on untrusted user input.

If you really can't bear to load in a library, you can use the textarea hack described in this answer to a near-duplicate question, which, unlike various similar approaches that have been suggested, has no security holes that I know of:

function decodeEntities(encodedString) {
    var textArea = document.createElement('textarea');
    textArea.innerHTML = encodedString;
    return textArea.value;
}

console.log(decodeEntities('1 &amp; 2')); // '1 & 2'

But take note of the security issues, affecting similar approaches to this one, that I list in the linked answer! This approach is a hack, and future changes to the permissible content of a textarea (or bugs in particular browsers) could lead to code that relies upon it suddenly having an XSS hole one day.

Community
  • 1
  • 1
Mark Amery
  • 143,130
  • 81
  • 406
  • 459
  • 1
    Matthias Bynens' library `he` is absolutely great! Thank you very much for the recommendation! – Pedro A Feb 02 '18 at 01:16
40

If you're using jQuery:

function htmlDecode(value){ 
  return $('<div/>').html(value).text(); 
}

Otherwise, use Strictly Software's Encoder Object, which has an excellent htmlDecode() function.

Chris Fulstow
  • 41,170
  • 10
  • 86
  • 110
28

You can use Lodash unescape / escape function https://lodash.com/docs/4.17.5#unescape

import unescape from 'lodash/unescape';

const str = unescape('fred, barney, &amp; pebbles');

str will become 'fred, barney, & pebbles'

I am L
  • 4,288
  • 6
  • 32
  • 49
  • 2
    probably better to do "import _unescape from 'lodash/unescape';" so it doesn't conflict with the deprecated javascript function of the same name: unescape – Rick Penabella Oct 20 '19 at 12:16
  • The best answer. We already have lodash in our project and it also escapes more correctly than he. – Eugene Barsky Mar 17 '23 at 14:16
  • Lodash only unescapes five entities ('&': '&', '<': '<', '>': '>', '"': '"', ''': "'"), and [warns its users](https://github.com/lodash/lodash/blob/2da024c3b4f9947a48517639de7560457cd4ec6c/unescape.js#L32), "_**Note:** No other HTML entities are unescaped. To unescape additional HTML entities use a third-party library like [_he_](https://mths.be/he)._" fwiw, 2¢, etc. – ruffin Aug 03 '23 at 22:08
23
var htmlEnDeCode = (function() {
    var charToEntityRegex,
        entityToCharRegex,
        charToEntity,
        entityToChar;

    function resetCharacterEntities() {
        charToEntity = {};
        entityToChar = {};
        // add the default set
        addCharacterEntities({
            '&amp;'     :   '&',
            '&gt;'      :   '>',
            '&lt;'      :   '<',
            '&quot;'    :   '"',
            '&#39;'     :   "'"
        });
    }

    function addCharacterEntities(newEntities) {
        var charKeys = [],
            entityKeys = [],
            key, echar;
        for (key in newEntities) {
            echar = newEntities[key];
            entityToChar[key] = echar;
            charToEntity[echar] = key;
            charKeys.push(echar);
            entityKeys.push(key);
        }
        charToEntityRegex = new RegExp('(' + charKeys.join('|') + ')', 'g');
        entityToCharRegex = new RegExp('(' + entityKeys.join('|') + '|&#[0-9]{1,5};' + ')', 'g');
    }

    function htmlEncode(value){
        var htmlEncodeReplaceFn = function(match, capture) {
            return charToEntity[capture];
        };

        return (!value) ? value : String(value).replace(charToEntityRegex, htmlEncodeReplaceFn);
    }

    function htmlDecode(value) {
        var htmlDecodeReplaceFn = function(match, capture) {
            return (capture in entityToChar) ? entityToChar[capture] : String.fromCharCode(parseInt(capture.substr(2), 10));
        };

        return (!value) ? value : String(value).replace(entityToCharRegex, htmlDecodeReplaceFn);
    }

    resetCharacterEntities();

    return {
        htmlEncode: htmlEncode,
        htmlDecode: htmlDecode
    };
})();

This is from ExtJS source code.

WaiKit Kung
  • 1,296
  • 1
  • 14
  • 15
  • 4
    -1; this fails to handle the vast majority of named entities. For instance, `htmlEnDecode.htmlDecode('€')` should return `'€'`, but instead returns `'€'`. – Mark Amery Feb 19 '17 at 15:22
19

The trick is to use the power of the browser to decode the special HTML characters, but not allow the browser to execute the results as if it was actual html... This function uses a regex to identify and replace encoded HTML characters, one character at a time.

function unescapeHtml(html) {
    var el = document.createElement('div');
    return html.replace(/\&[#0-9a-z]+;/gi, function (enc) {
        el.innerHTML = enc;
        return el.innerText
    });
}
Ben White
  • 263
  • 3
  • 6
17

element.innerText also does the trick.

laggingreflex
  • 32,948
  • 35
  • 141
  • 196
avg_joe
  • 195
  • 1
  • 2
14

In case you're looking for it, like me - meanwhile there's a nice and safe JQuery method.

https://api.jquery.com/jquery.parsehtml/

You can f.ex. type this in your console:

var x = "test &amp;";
> undefined
$.parseHTML(x)[0].textContent
> "test &"

So $.parseHTML(x) returns an array, and if you have HTML markup within your text, the array.length will be greater than 1.

cslotty
  • 1,696
  • 20
  • 28
  • Worked perfectly for me, this was exactly what i was looking for, thank you. – Jonathan Nielsen Jul 17 '19 at 06:40
  • 1
    If `x` has a value of `` the above will crash. In current jQuery it won't actually try to run the script, but `[0]` will yield `undefined` so the call to `textContent` will fail and your script will stop there. `$('
    ').html(x).text();` looks safer - via https://gist.github.com/jmblog/3222899
    – Andrew Hodgkinson Aug 13 '19 at 01:10
  • @AndrewHodgkinson yeah, but the question was "Decode & back to & in JavaScript" - so you'd test the contents of x first or make sure you only use it in the correct cases. – cslotty Dec 05 '19 at 07:24
  • I don't really see how that follows. The code above works in all cases. And just how exactly would you "make sure" the value of x needed fixing? And what if the script example above alerted '&' so that it really did need correction? We have no idea where the OP's strings come from, so malicious input must be considered. – Andrew Hodgkinson Dec 06 '19 at 21:16
  • @AndrewHodgkinson I like your consideration, but that's not the question here. Feel free to answer that question, though. I guess you could remove script tags, f.ex. – cslotty Dec 07 '19 at 14:21
  • @AndrewHodgkinson your solution works flawlessly! you should consider answering this question with it. Nice, clear, short and efficient. +1 – Sergio A. Mar 11 '20 at 07:40
  • @SergioA. Thanks, and, done: https://stackoverflow.com/a/60645505 – Andrew Hodgkinson Mar 11 '20 at 23:04
10

jQuery will encode and decode for you. However, you need to use a textarea tag, not a div.

var str1 = 'One & two & three';
var str2 = "One &amp; two &amp; three";
  
$(document).ready(function() {
   $("#encoded").text(htmlEncode(str1)); 
   $("#decoded").text(htmlDecode(str2));
});

function htmlDecode(value) {
  return $("<textarea/>").html(value).text();
}

function htmlEncode(value) {
  return $('<textarea/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>

<div id="encoded"></div>
<div id="decoded"></div>
Jason Williams
  • 2,740
  • 28
  • 36
  • 2
    -1 because there's a (surprising) security hole here for old jQuery versions, some of which probably still have a significant user base - those versions will [*detect and explicitly evaluate scripts*](https://github.com/jquery/jquery/blob/1.7/jquery.js#L6049) in the HTML passed to `.html()`. Thus even using a `textarea` isn't enough to ensure security here; I suggest [not using jQuery for this task and writing equivalent code with the plain DOM API](http://stackoverflow.com/a/1395954/1709587). (Yes, that old behaviour by jQuery is mad and awful.) – Mark Amery Feb 19 '17 at 15:30
  • Thank you for pointing that out. However, the question does not include a requirement to check for script injection. The question specifically asks about html rendered by the web server. Html content saved to a web server should probably be validated for script injection before save. – Jason Williams Feb 22 '17 at 21:07
  • I used your example and made the vanilla version (down the page) – Luis Lobo Nov 18 '22 at 15:07
6

CMS' answer works fine, unless the HTML you want to unescape is very long, longer than 65536 chars. Because then in Chrome the inner HTML gets split into many child nodes, each one at most 65536 long, and you need to concatenate them. This function works also for very long strings:

function unencodeHtmlContent(escapedHtml) {
  var elem = document.createElement('div');
  elem.innerHTML = escapedHtml;
  var result = '';
  // Chrome splits innerHTML into many child nodes, each one at most 65536.
  // Whereas FF creates just one single huge child node.
  for (var i = 0; i < elem.childNodes.length; ++i) {
    result = result + elem.childNodes[i].nodeValue;
  }
  return result;
}

See this answer about innerHTML max length for more info: https://stackoverflow.com/a/27545633/694469

Community
  • 1
  • 1
KajMagnus
  • 11,308
  • 15
  • 79
  • 127
5

To unescape HTML entities* in JavaScript you can use small library html-escaper: npm install html-escaper

import {unescape} from 'html-escaper';

unescape('escaped string');

Or unescape function from Lodash or Underscore, if you are using it.


*) please note that these functions don't cover all HTML entities, but only the most common ones, i.e. &, <, >, ', ". To unescape all HTML entities you can use he library.

Łukasz K
  • 562
  • 6
  • 10
4

First create a <span id="decodeIt" style="display:none;"></span> somewhere in the body

Next, assign the string to be decoded as innerHTML to this:

document.getElementById("decodeIt").innerHTML=stringtodecode

Finally,

stringtodecode=document.getElementById("decodeIt").innerText

Here is the overall code:

var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText
Chris
  • 57,622
  • 19
  • 111
  • 137
Infoglaze.com
  • 85
  • 2
  • 3
  • 2
    -1; this is dangerously insecure to use on untrusted input. For instance, consider what happens if `stringtodecode` contains something like ``. – Mark Amery Feb 19 '17 at 15:25
3

The question doesn't specify the origin of x but it makes sense to defend, if we can, against malicious (or just unexpected, from our own application) input. For example, suppose x has a value of &amp; <script>alert('hello');</script>. A safe and simple way to handle this in jQuery is:

var x    = "&amp; <script>alert('hello');</script>";
var safe = $('<div />').html(x).text();

// => "& alert('hello');"

Found via https://gist.github.com/jmblog/3222899. I can't see many reasons to avoid using this solution given it is at least as short, if not shorter than some alternatives and provides defence against XSS.

(I originally posted this as a comment, but am adding it as an answer since a subsequent comment in the same thread requested that I do so).

Andrew Hodgkinson
  • 4,379
  • 3
  • 33
  • 43
2

Not a direct response to your question, but wouldn't it be better for your RPC to return some structure (be it XML or JSON or whatever) with those image data (urls in your example) inside that structure?

Then you could just parse it in your javascript and build the <img> using javascript itself.

The structure you recieve from RPC could look like:

{"img" : ["myimage.jpg", "myimage2.jpg"]}

I think it's better this way, as injecting a code that comes from external source into your page doesn't look very secure. Imaging someone hijacking your XML-RPC script and putting something you wouldn't want in there (even some javascript...)

kender
  • 85,663
  • 26
  • 103
  • 145
  • Does the @CMS approach above have this security flaw? – Joseph Turian Dec 16 '09 at 06:30
  • I just checked the following argument passed to htmlDecode fuction: htmlDecode("<img src='myimage.jpg'><script>document.write('xxxxx');</script>") and it creates the element that can be bad, imho. And I still think returning a structure instead of text to be inserted is better, you can handle errors nicely for example. – kender Dec 16 '09 at 07:06
  • 1
    I just tried `htmlDecode("<img src='myimage.jpg'><script>alert('xxxxx');</script>")` and nothing happened. I got the decoded html string back as expected. – Roatin Marth Dec 16 '09 at 21:05
2

For one-line guys:

const htmlDecode = innerHTML => Object.assign(document.createElement('textarea'), {innerHTML}).value;

console.log(htmlDecode('Complicated - Dimitri Vegas &amp; Like Mike'));
ninhjs.dev
  • 7,203
  • 1
  • 49
  • 35
2

You're welcome...just a messenger...full credit goes to ourcodeworld.com, link below.

window.htmlentities = {
        /**
         * Converts a string to its html characters completely.
         *
         * @param {String} str String with unescaped HTML characters
         **/
        encode : function(str) {
            var buf = [];

            for (var i=str.length-1;i>=0;i--) {
                buf.unshift(['&#', str[i].charCodeAt(), ';'].join(''));
            }

            return buf.join('');
        },
        /**
         * Converts an html characterSet into its original character.
         *
         * @param {String} str htmlSet entities
         **/
        decode : function(str) {
            return str.replace(/&#(\d+);/g, function(match, dec) {
                return String.fromCharCode(dec);
            });
        }
    };

Full Credit: https://ourcodeworld.com/articles/read/188/encode-and-decode-html-entities-using-pure-javascript

buycanna.io
  • 1,166
  • 16
  • 18
  • 1
    This is an incomplete solution; it only handles decimal numeric character references, not named character references or hexadecimal numeric character reference. – Mark Amery Dec 06 '21 at 20:30
2

I know there are a lot of good answers here, but since I have implemented a bit different approach, I thought to share.

This code is a perfectly safe security-wise approach, as the escaping handler dependant on the browser, instead on the function. So, if a new vulnerability will be discovered in the future, this solution will be covered.

const decodeHTMLEntities = text => {
    // Create a new element or use one from cache, to save some element creation overhead
    const el = decodeHTMLEntities.__cache_data_element 
             = decodeHTMLEntities.__cache_data_element 
               || document.createElement('div');
    
    const enc = text
        // Prevent any mixup of existing pattern in text
        .replace(/⪪/g, '⪪#')
        // Encode entities in special format. This will prevent native element encoder to replace any amp characters
        .replace(/&([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+);/gi, '⪪$1⪫');

    // Encode any HTML tags in the text to prevent script injection
    el.textContent = enc;

    // Decode entities from special format, back to their original HTML entities format
    el.innerHTML = el.innerHTML
        .replace(/⪪([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+)⪫/gi, '&$1;')
        .replace(/#⪫/g, '⪫');
   
    // Get the decoded HTML entities
    const dec = el.textContent;
    
    // Clear the element content, in order to preserve a bit of memory (it is just the text may be pretty big)
    el.textContent = '';

    return dec;
}

// Example
console.log(decodeHTMLEntities("<script>alert('&awconint;&CounterClockwiseContourIntegral;&#x02233;&#8755;⪪#x02233⪫');</script>"));
// Prints: <script>alert('∳∳∳∳⪪##x02233⪫');</script>

By the way, I have chosen to use the characters and , because they are rarely used, so the chance of impacting the performance by matching them is significantly lower.

Slavik Meltser
  • 9,712
  • 3
  • 47
  • 48
1

Chris answer is nice & elegant but it fails if value is undefined. Just simple improvement makes it solid:

function htmlDecode(value) {
   return (typeof value === 'undefined') ? '' : $('<div/>').html(value).text();
}
nerijus
  • 516
  • 5
  • 11
  • 1
    If do improve, then do: `return (typeof value !== 'string') ? '' : $('
    ').html(value).text();`
    – SynCap Jun 26 '17 at 08:09
1

a javascript solution that catches the common ones:

var map = {amp: '&', lt: '<', gt: '>', quot: '"', '#039': "'"}
str = str.replace(/&([^;]+);/g, (m, c) => map[c])

this is the reverse of https://stackoverflow.com/a/4835406/2738039

Community
  • 1
  • 1
HK JR
  • 260
  • 1
  • 8
1

I tried everything to remove & from a JSON array. None of the above examples, but https://stackoverflow.com/users/2030321/chris gave a great solution that led me to fix my problem.

var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText

I did not use, because I did not understand how to insert it into a modal window that was pulling JSON data into an array, but I did try this based upon the example, and it worked:

var modal = document.getElementById('demodal');
$('#ampersandcontent').text(replaceAll(data[0],"&amp;", "&"));

I like it because it was simple, and it works, but not sure why it's not widely used. Searched hi & low to find a simple solution. I continue to seek understanding of the syntax, and if there is any risk to using this. Have not found anything yet.

TheLethalCoder
  • 6,668
  • 6
  • 34
  • 69
Digexart
  • 21
  • 5
  • Your first propose is just a bit tricky, but it works nice without much effort. The second one, on the other hand, uses only brute force to decode characters; this means it could take a LOT of effort and time to accomplish a full decoding function. That's why no one is using that way to solve OP's problem. – Sergio A. Mar 11 '20 at 07:32
0

I was crazy enough to go through and make this function that should be pretty, if not completely, exhaustive:

function removeEncoding(string) {
    return string.replace(/&Agrave;/g, "À").replace(/&Aacute;/g, "Á").replace(/&Acirc;/g, "Â").replace(/&Atilde;/g, "Ã").replace(/&Auml;/g, "Ä").replace(/&Aring;/g, "Å").replace(/&agrave;/g, "à").replace(/&acirc;/g, "â").replace(/&atilde;/g, "ã").replace(/&auml;/g, "ä").replace(/&aring;/g, "å").replace(/&AElig;/g, "Æ").replace(/&aelig;/g, "æ").replace(/&szlig;/g, "ß").replace(/&Ccedil;/g, "Ç").replace(/&ccedil;/g, "ç").replace(/&Egrave;/g, "È").replace(/&Eacute;/g, "É").replace(/&Ecirc;/g, "Ê").replace(/&Euml;/g, "Ë").replace(/&egrave;/g, "è").replace(/&eacute;/g, "é").replace(/&ecirc;/g, "ê").replace(/&euml;/g, "ë").replace(/&#131;/g, "ƒ").replace(/&Igrave;/g, "Ì").replace(/&Iacute;/g, "Í").replace(/&Icirc;/g, "Î").replace(/&Iuml;/g, "Ï").replace(/&igrave;/g, "ì").replace(/&iacute;/g, "í").replace(/&icirc;/g, "î").replace(/&iuml;/g, "ï").replace(/&Ntilde;/g, "Ñ").replace(/&ntilde;/g, "ñ").replace(/&Ograve;/g, "Ò").replace(/&Oacute;/g, "Ó").replace(/&Ocirc;/g, "Ô").replace(/&Otilde;/g, "Õ").replace(/&Ouml;/g, "Ö").replace(/&ograve;/g, "ò").replace(/&oacute;/g, "ó").replace(/&ocirc;/g, "ô").replace(/&otilde;/g, "õ").replace(/&ouml;/g, "ö").replace(/&Oslash;/g, "Ø").replace(/&oslash;/g, "ø").replace(/&#140;/g, "Œ").replace(/&#156;/g, "œ").replace(/&#138;/g, "Š").replace(/&#154;/g, "š").replace(/&Ugrave;/g, "Ù").replace(/&Uacute;/g, "Ú").replace(/&Ucirc;/g, "Û").replace(/&Uuml;/g, "Ü").replace(/&ugrave;/g, "ù").replace(/&uacute;/g, "ú").replace(/&ucirc;/g, "û").replace(/&uuml;/g, "ü").replace(/&#181;/g, "µ").replace(/&#215;/g, "×").replace(/&Yacute;/g, "Ý").replace(/&#159;/g, "Ÿ").replace(/&yacute;/g, "ý").replace(/&yuml;/g, "ÿ").replace(/&#176;/g, "°").replace(/&#134;/g, "†").replace(/&#135;/g, "‡").replace(/&lt;/g, "<").replace(/&gt;/g, ">").replace(/&#177;/g, "±").replace(/&#171;/g, "«").replace(/&#187;/g, "»").replace(/&#191;/g, "¿").replace(/&#161;/g, "¡").replace(/&#183;/g, "·").replace(/&#149;/g, "•").replace(/&#153;/g, "™").replace(/&copy;/g, "©").replace(/&reg;/g, "®").replace(/&#167;/g, "§").replace(/&#182;/g, "¶").replace(/&Alpha;/g, "Α").replace(/&Beta;/g, "Β").replace(/&Gamma;/g, "Γ").replace(/&Delta;/g, "Δ").replace(/&Epsilon;/g, "Ε").replace(/&Zeta;/g, "Ζ").replace(/&Eta;/g, "Η").replace(/&Theta;/g, "Θ").replace(/&Iota;/g, "Ι").replace(/&Kappa;/g, "Κ").replace(/&Lambda;/g, "Λ").replace(/&Mu;/g, "Μ").replace(/&Nu;/g, "Ν").replace(/&Xi;/g, "Ξ").replace(/&Omicron;/g, "Ο").replace(/&Pi;/g, "Π").replace(/&Rho;/g, "Ρ").replace(/&Sigma;/g, "Σ").replace(/&Tau;/g, "Τ").replace(/&Upsilon;/g, "Υ").replace(/&Phi;/g, "Φ").replace(/&Chi;/g, "Χ").replace(/&Psi;/g, "Ψ").replace(/&Omega;/g, "Ω").replace(/&alpha;/g, "α").replace(/&beta;/g, "β").replace(/&gamma;/g, "γ").replace(/&delta;/g, "δ").replace(/&epsilon;/g, "ε").replace(/&zeta;/g, "ζ").replace(/&eta;/g, "η").replace(/&theta;/g, "θ").replace(/&iota;/g, "ι").replace(/&kappa;/g, "κ").replace(/&lambda;/g, "λ").replace(/&mu;/g, "μ").replace(/&nu;/g, "ν").replace(/&xi;/g, "ξ").replace(/&omicron;/g, "ο").replace(/&piρ;/g, "ρ").replace(/&rho;/g, "ς").replace(/&sigmaf;/g, "ς").replace(/&sigma;/g, "σ").replace(/&tau;/g, "τ").replace(/&phi;/g, "φ").replace(/&chi;/g, "χ").replace(/&psi;/g, "ψ").replace(/&omega;/g, "ω").replace(/&bull;/g, "•").replace(/&hellip;/g, "…").replace(/&prime;/g, "′").replace(/&Prime;/g, "″").replace(/&oline;/g, "‾").replace(/&frasl;/g, "⁄").replace(/&weierp;/g, "℘").replace(/&image;/g, "ℑ").replace(/&real;/g, "ℜ").replace(/&trade;/g, "™").replace(/&alefsym;/g, "ℵ").replace(/&larr;/g, "←").replace(/&uarr;/g, "↑").replace(/&rarr;/g, "→").replace(/&darr;/g, "↓").replace(/&barr;/g, "↔").replace(/&crarr;/g, "↵").replace(/&lArr;/g, "⇐").replace(/&uArr;/g, "⇑").replace(/&rArr;/g, "⇒").replace(/&dArr;/g, "⇓").replace(/&hArr;/g, "⇔").replace(/&forall;/g, "∀").replace(/&part;/g, "∂").replace(/&exist;/g, "∃").replace(/&empty;/g, "∅").replace(/&nabla;/g, "∇").replace(/&isin;/g, "∈").replace(/&notin;/g, "∉").replace(/&ni;/g, "∋").replace(/&prod;/g, "∏").replace(/&sum;/g, "∑").replace(/&minus;/g, "−").replace(/&lowast;/g, "∗").replace(/&radic;/g, "√").replace(/&prop;/g, "∝").replace(/&infin;/g, "∞").replace(/&OEig;/g, "Œ").replace(/&oelig;/g, "œ").replace(/&Yuml;/g, "Ÿ").replace(/&spades;/g, "♠").replace(/&clubs;/g, "♣").replace(/&hearts;/g, "♥").replace(/&diams;/g, "♦").replace(/&thetasym;/g, "ϑ").replace(/&upsih;/g, "ϒ").replace(/&piv;/g, "ϖ").replace(/&Scaron;/g, "Š").replace(/&scaron;/g, "š").replace(/&ang;/g, "∠").replace(/&and;/g, "∧").replace(/&or;/g, "∨").replace(/&cap;/g, "∩").replace(/&cup;/g, "∪").replace(/&int;/g, "∫").replace(/&there4;/g, "∴").replace(/&sim;/g, "∼").replace(/&cong;/g, "≅").replace(/&asymp;/g, "≈").replace(/&ne;/g, "≠").replace(/&equiv;/g, "≡").replace(/&le;/g, "≤").replace(/&ge;/g, "≥").replace(/&sub;/g, "⊂").replace(/&sup;/g, "⊃").replace(/&nsub;/g, "⊄").replace(/&sube;/g, "⊆").replace(/&supe;/g, "⊇").replace(/&oplus;/g, "⊕").replace(/&otimes;/g, "⊗").replace(/&perp;/g, "⊥").replace(/&sdot;/g, "⋅").replace(/&lcell;/g, "⌈").replace(/&rcell;/g, "⌉").replace(/&lfloor;/g, "⌊").replace(/&rfloor;/g, "⌋").replace(/&lang;/g, "⟨").replace(/&rang;/g, "⟩").replace(/&loz;/g, "◊").replace(/&#039;/g, "'").replace(/&amp;/g, "&").replace(/&quot;/g, "\"");
}

Used like so:

let decodedText = removeEncoding("Ich hei&szlig;e David");
console.log(decodedText);

Prints: Ich Heiße David

P.S. this took like an hour and a half to make.

David Chopin
  • 2,780
  • 2
  • 19
  • 40
0

This is the most comprehensive solution I've tried so far:

const STANDARD_HTML_ENTITIES = {
    nbsp: String.fromCharCode(160),
    amp: "&",
    quot: '"',
    lt: "<",
    gt: ">"
};

const replaceHtmlEntities = plainTextString => {
    return plainTextString
        .replace(/&#(\d+);/g, (match, dec) => String.fromCharCode(dec))
        .replace(
            /&(nbsp|amp|quot|lt|gt);/g,
            (a, b) => STANDARD_HTML_ENTITIES[b]
        );
};
Daniel
  • 1,599
  • 1
  • 16
  • 19
  • 2
    "The most comprehensive"? Have you tried running it against an [actually comprehensive test suite](https://github.com/mathiasbynens/he#he---)? – Dan Dascalescu Jul 01 '20 at 01:20
0

Closures can avoid creating unnecessary objects.

const decodingHandler = (() => {
  const element = document.createElement('div');
  return text => {
    element.innerHTML = text;
    return element.textContent;
  };
})();

A more concise way

const decodingHandler = (() => {
  const element = document.createElement('div');
  return text => ((element.innerHTML = text), element.textContent);
})();
weiya ou
  • 2,730
  • 1
  • 16
  • 24
  • wouldnt `innerHTML` introduce XSS vulnerability here as string is is being passed into it? Better to use `innertText` – shwz May 19 '22 at 07:13
0

Use Dentity! I found none of the answers above satisfying, so I cherry picked some stuff from here, fixed their problems and added the complete W3C entity definitions, and some more functionality. I also made it as small as possible, which is now 31KB minified and 14KB when gzipped. You can download it from https://github.com/arashkazemi/dentity

It includes both the decoder and encoder functions and it works both in browser and in node environment. I hope it solves the problem efficiently!

arashka
  • 1,226
  • 3
  • 17
  • 30
-1

I use this in my project: inspired by other answers but with an extra secure parameter, can be useful when you deal with decorated characters

var decodeEntities=(function(){

    var el=document.createElement('div');
    return function(str, safeEscape){

        if(str && typeof str === 'string'){

            str=str.replace(/\</g, '&lt;');

            el.innerHTML=str;
            if(el.innerText){

                str=el.innerText;
                el.innerText='';
            }
            else if(el.textContent){

                str=el.textContent;
                el.textContent='';
            }

            if(safeEscape)
                str=str.replace(/\</g, '&lt;');
        }
        return str;
    }
})();

And it's usable like:

var label='safe <b> character &eacute;ntity</b>';
var safehtml='<div title="'+decodeEntities(label)+'">'+decodeEntities(label, true)+'</div>';
tmx976
  • 57
  • 5
-1
var encodedStr = 'hello &amp; world';

var parser = new DOMParser;
var dom = parser.parseFromString(
    '<!doctype html><body>' + encodedStr,
    'text/html');
var decodedString = dom.body.textContent;

console.log(decodedString);
jagjeet
  • 376
  • 4
  • 12
  • 2
    @Wladimir Palant (author of AdBlock Plus) already gave the DOMParser answer [4 years](https://stackoverflow.com/questions/1912501/unescape-html-entities-in-javascript/34064434#34064434) earlier. Have you read the previous answers before posting yours? – Dan Dascalescu Jul 01 '20 at 01:23
-1
// decode-html.js v1
function decodeHtml(html) {
    const textarea = document.createElement('textarea');
    textarea.innerHTML = html;
    const decodedHtml = textarea.textContent;
    textarea.remove();
    return decodedHtml;
};

// encode-html.js v1
function encodeHtml(html) {
    const textarea = document.createElement('textarea');
    textarea.textContent = html;
    const encodedHtml = textarea.innerHTML;
    textarea.remove();
    return encodedHtml;
};

// example of use:
let htmlDecoded = 'one & two & three';
let htmlEncoded = 'one &amp; two &amp; three';

console.log(1, htmlDecoded);
console.log(2, encodeHtml(htmlDecoded));

console.log(3, htmlEncoded);
console.log(4, decodeHtml(htmlEncoded));
Luis Lobo
  • 489
  • 4
  • 7
-2

All of the other answers here have problems.

The document.createElement('div') methods (including those using jQuery) execute any javascript passed into it (a security issue) and the DOMParser.parseFromString() method trims whitespace. Here is a pure javascript solution that has neither problem:

function htmlDecode(html) {
    var textarea = document.createElement("textarea");
    html= html.replace(/\r/g, String.fromCharCode(0xe000)); // Replace "\r" with reserved unicode character.
    textarea.innerHTML = html;
    var result = textarea.value;
    return result.replace(new RegExp(String.fromCharCode(0xe000), 'g'), '\r');
}

TextArea is used specifically to avoid executig js code. It passes these:

htmlDecode('&lt;&amp;&nbsp;&gt;'); // returns "<& >" with non-breaking space.
htmlDecode('  '); // returns "  "
htmlDecode('<img src="dummy" onerror="alert(\'xss\')">'); // Does not execute alert()
htmlDecode('\r\n') // returns "\r\n", doesn't lose the \r like other solutions.
EricP
  • 3,395
  • 3
  • 33
  • 46
  • 1
    No, using a different tag does **not** solve the issue. This is still an XSS vulnerability, try `htmlDecode("")`. You posted this after I already pointed out this issue on the answer by Sergio Belevskij. – Wladimir Palant Sep 18 '18 at 07:34
  • I'm unable to reproduce the issue you describe. I have your code in this JsFiddle, and no alert displays when running. http://jsfiddle.net/edsjt15g/1/ Can you take a look? What browser are you using? – EricP Sep 19 '18 at 17:19
  • 2
    I'm using Firefox. Chrome indeed handles this scenario differently, so the code doesn't execute - not something you should rely on however. – Wladimir Palant Sep 19 '18 at 17:30
-2

function decodeHTMLContent(htmlText) {
  var txt = document.createElement("span");
  txt.innerHTML = htmlText;
  return txt.innerText;
}

var result = decodeHTMLContent('One &amp; two &amp; three');
console.log(result);
nand-63
  • 107
  • 1
  • 4
  • How is this answer better than the `textarea` one given *years* ago? – Dan Dascalescu Jul 01 '20 at 01:16
  • This _will_ present a security issue. There's nothing stopping you from adding an `` into that and running arbitrary JS. **Do not use this or anything similar to it in production (or for a hobby project, if others will use it).** – Radvylf Programs Oct 09 '21 at 21:05
-3

The current top voted answer has the disadvantage of removing HTML from a string. If that isn't what you want (it certainly wasn't part of the question), then I suggest using regex to find HTML entities (/&[^;]*;/gmi), and then iterate thru the matches and just converting them.

function decodeHTMLEntities(str) {
  if (typeof str !== 'string') {
    return false;
  }
  var element = document.createElement('div');
  return str.replace(/&[^;]*;/gmi, entity => {
    entity = entity.replace(/</gm, '&lt;');
    element.innerHTML = entity;
    return element.textContent;
  });
}

var encoded_str = `<b>&#8593; &#67;&#65;&#78;'&#84;&nbsp;&#72;&#65;&#67;&#75;&nbsp;&#77;&#69;,&nbsp;&#66;&#82;&#79;</b>`;
var decoded_str = decodeHTMLEntities(encoded_str);

console.log(decoded_str);

Regarding XSS Attacks:

While innerHTML does not execute code in <script> tags, it is possible for code to be run in on* event attributes, so the above regex alone might be exploitable by a user passing in a string such as:

&<img src='asdfa' error='alert(`doin\' me a hack`)' />;

For that reason, it is necessary to convert any < characters to their &lt; before putting them in your hidden div element.

Also, just to cover all of my bases I'll note that functions defined like this, in the global scope can be overwritten by redifing them in the console, so it important to either define this function with const, or put it in a non-global scope.

Note: The attempted exploits in the following example confuse the stack snippet editor because of the preprocessing it does, so you'll have to run it in the browser's console, or in it's own file to see the result.

var tests = [
  "here's a spade: &spades;!",
  "&<script>alert('hackerman')</script>;",
  "&<img src='asdfa' error='alert(`doin\' me a hack`)' />;",
  "<b>&#8593; &#67;&#65;&#78;'&#84;&nbsp;&#72;&#65;&#67;&#75;&nbsp;&#77;&#69;,&nbsp;&#66;&#82;&#79;</b>"
];

var decoded = tests.map(decodeHTMLEntities).join("\n");
console.log(decoded);

The result is:

here's a spade: ♠!
&<script>alert('hackerman')</script>;
&<img src='asdfa' error='alert(`doin' me a hack`)' />;
<b>↑ CAN'T HACK ME, BRO</b>
I wrestled a bear once.
  • 22,983
  • 19
  • 69
  • 116
  • This fails to prevent XSS attacks. You specifically reference the "top voted answer", but your answer fails to handle the simple XSS attack outlined in that answer. Simply rejecting ` – user229044 Jan 10 '23 at 15:19
  • Actually there is lots more wrong here. This fails to even reject ` – user229044 Jan 10 '23 at 15:31
  • I'm really not sure what you mean. Is your intent not to leave `<script` alone, so that it remains encoded? Regardless, you're still doing this: `element.innerHTML = '<'`, even when your code is fed something like `<script`, so what is the `.includes('SCRIPT')` supposed to prevent here? `entity` only ever contains the `&..;` stuff, ie `<`. It can never include the word `script`, unless you have something like `&script;` which is nonsense. In any case, your code returns corrupted output when the input contains script tags `decodeHTMLEntities('<script>') // '\x3Cscript>'` – user229044 Jan 10 '23 at 20:10
  • @user229044 - That's not what I'm getting when I run the code in the console on Firefox. It doesn't work on any of the online cde editors I tried becuase of the preprocessing they do, but when you run the code as is (`decodeHTMLEntities('<script>')`) it produces `;')` from being executed, since, as far as i know, there aren't any html character codes that include the word "script". – I wrestled a bear once. Jan 10 '23 at 20:17
  • I should also note that I've been using this code in production on a site that gets several thousand new unique visitors a day for about a week and have no iddues. – I wrestled a bear once. Jan 10 '23 at 20:20
  • 1
    XSS is certainly possible with `innerHTML`, again, refer to the top-voted answer's example of an XSS payload: `document.createElement('div').innerHTML=(\`\`)` – user229044 Jan 10 '23 at 20:53
  • Fixed it, what else am I missing, @user229044? – I wrestled a bear once. Jan 11 '23 at 16:15
-8

There is an variant that 80% as productive as the answers at the very top.

See the benchmark: https://jsperf.com/decode-html12345678/1

performance test

console.log(decodeEntities('test: &gt'));

function decodeEntities(str) {
  // this prevents any overhead from creating the object each time
  const el = decodeEntities.element || document.createElement('textarea')

  // strip script/html tags
  el.innerHTML = str
    .replace(/<script[^>]*>([\S\s]*?)<\/script>/gmi, '')
    .replace(/<\/?\w(?:[^"'>]|"[^"]*"|'[^']*')*>/gmi, '');

  return el.value;
}

If you need to leave tags, then remove the two .replace(...) calls (you can leave the first one if you do not need scripts).

  • 11
    Congratulations, you managed to obscure the vulnerability with bogus sanitizaion logic, all for a performance win that won't matter in practice. Try calling `decodeEntities("")` in Firefox. Please stop attempting to sanitize HTML code with regular expressions. – Wladimir Palant Mar 14 '19 at 10:01