4

If you have a string containing HTML entities and want to unescape it, this solution (or variants thereof) is suggested multiple times:

function htmlDecode(input){
  var e = document.createElement('div');
  e.innerHTML = input;
  return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}

htmlDecode("<img src='myimage.jpg'>"); 
// returns "<img src='myimage.jpg'>"

(See, for example, this answer: https://stackoverflow.com/a/1912522/1199564)

This works fine as long as the string does not contain newline and we are not running on Internet Explorer version pre 10 (tested on version 9 and 8).

If the string contains a newline, IE 8 and 9 will replace it with a space character instead of leaving it unchanged (as it is on Chrome, Safari, Firefox and IE 10).

htmlDecode("Hello\nWorld"); 
// returns "Hello World" on IE 8 and 9

Any suggestions for a solution that works with IE before version 10?

Community
  • 1
  • 1
mgd
  • 4,114
  • 3
  • 23
  • 32

1 Answers1

4

The most simple, but probably not the most efficient solution is to have htmlDecode() act only on character and entity references:

var s = "foo\n&amp;\nbar";
s = s.replace(/(&[^;]+;)+/g, htmlDecode);

More efficient is using an optimized rewrite of htmlDecode() that is only called once per input, acts only on character and entity references, and reuses the DOM element object:

function htmlDecode (input)
{
  var e = document.createElement("span");

  var result = input.replace(/(&[^;]+;)+/g, function (match) {
    e.innerHTML = match;
    return e.firstChild.nodeValue;
  });

  return result;
}

/* returns "foo\n&\nbar" */
htmlDecode("foo\n&amp;\nbar");

Wladimir Palant has pointed out an XSS issue with this function: The value of some (HTML5) event listener attributes, like onerror, is executed if you assign HTML with elements that have those attributes specified to the innerHTML property. So you should not use this function on arbitrary input containing actual HTML, only on HTML that is already escaped. Otherwise you should adapt the regular expression accordingly, for example use /(&[^;<>]+;)+/ instead to prevent &…; where contains tags from being matched.

For arbitrary HTML, please see his alternative approach, but note that it is not as compatible as this one.

PointedEars
  • 14,752
  • 4
  • 34
  • 33
  • 1
    Thank you. Works like a charm. I would suggest you edit the sample string "foo & bar" to include "\n" characters like this "foo\n&\nbar" to illustrate that the code handles newlines correctly. Also, could you please explain why 'e' is involved in a circular reference. – mgd Sep 25 '12 at 15:02
  • Makes sense. I would like to accept your edit to give you the credit, but I do not know how as it is the first time someone edited one of my answers. I can only see two spurious(?) rejects and no "Accept" button :-/ – PointedEars Sep 25 '12 at 16:18
  • @mgd Apparently one cannot approve already rejected edits, so I have applied your edit and +1 for the comment. – PointedEars Sep 25 '12 at 16:32
  • Why is `e` a circular reference? It's used in the anonymous function but when the code returns, references to both are forgotten, so GC should be happy :-/ – Aaron Digulla Sep 25 '12 at 17:09
  • @PointedEars My edit might have been rejected because I had forgotten to insert a change comment. (embarrassing) – mgd Sep 25 '12 at 19:50
  • @AaronDigulla I see that the anon function references `e` but I don't see where the circularity is. If `a` references `b` which references `a` we have a circular reference. But why in this case? And why does it help to `null` `e` which goes out of scope right after the function returns? – mgd Sep 25 '12 at 20:31
  • 2
    Circular reference (CMIIW): `e` → `e.ownerDocument` → (`e.ownerDocument.defaultView` === `window`) → `window.htmlDecode` → `window.htmlDecode.[[Scope]]` → `e`. The problem is that (older) JScript's GC cannot clean that up even though `e` goes out of scope. See also: [Understanding and Solving Internet Explorer Leak Patterns](http://msdn.microsoft.com/en-us/library/bb250448(VS.85).aspx) – PointedEars Sep 25 '12 at 23:58
  • 1
    @PointedEars: Learned something new. Thanks! – Aaron Digulla Sep 26 '12 at 08:35
  • @AaronDigulla You are welcome. And thank you for your edit, I have adapted it only slightly. – PointedEars Sep 26 '12 at 09:20
  • @PointedEars: Consider this illegally encoded string `var s = "x\n&amp\ny\n&\nz";`. The pattern your algorithm uses will match from the first `&` to the last `;` which means it will include two `\n` characters and will not work correctly on IE 8 and 9. Changing the pattern to `/(&[^&;]+;)+/g`will make it match from the last `&`. Now it works again on IE 8 and 9. However, on (at least) Chrome `&amp\n` is normally incorrectly (?) unescaped to `&` whereas this algorithm using the new pattern will leave it untouched which I guess is correct. I would suggest changing the pattern in your answer. – mgd Sep 26 '12 at 11:41
  • @PointedEars: Tried to make an edit to the question changing the pattern (two locations) and adding an example `"foo\n&amp\nbar\n&\nbaz"`at the end illustrating the point. Apparently, the edit has been rejected (at least it doesn't show up in my browser as it used to do until accepted). – mgd Sep 26 '12 at 12:33
  • @PointedEars: Regarding circular references, IMHO the circular chain is incorrect in this location: `window.htmlDecode` → `window.htmlDecode.[[Scope]]`. Every time the function is called a new scope is created and the reference goes from the scope to the function. Also, the closure (anon function capturing `e`) also goes out of scope when the function returns. Therefore, the `null`ing is not necessary. – mgd Sep 26 '12 at 13:03
  • @mgd Please substantiate your claim about `[[Scope]]`. I can see nothing in ECMA-262-5.1 §13.2.1 to suggest that. In fact, I think if it was as you suggest, closures would not work. I will look into `e` later. – PointedEars Sep 26 '12 at 19:10
  • @mgd You have to draw the line somewhere. This function is not intended to process invalid markup, and I will not update it or approve changes to it so that it can (it was not me who rejected your edits, though). Feel free to do that in your code, and carefully consider the complexities before. However, according to [SGML](http://www.w3.org/MarkUp/SGML/productions.html#prod59), entity references are terminated either by `;` or RE. And at least for HTML 4.01, [`RE` is defined to be what is matched by `/\r/`](http://www.w3.org/TR/html401/HTML4.decl). So I have updated the expression accordingly. – PointedEars Sep 26 '12 at 19:30
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/17232/discussion-between-mgd-and-pointedears) – mgd Sep 27 '12 at 10:55
  • Please consider restricting the characters accepted by your regular expression so that it won't match HTML tags, e.g. `/(&[^;<>\s]+;)+/`. Then your `htmlDecode` function will be safe to use even on unsafe input, something that cannot be said about the original function (see [my answer there](https://stackoverflow.com/a/34064434/785541)). – Wladimir Palant Feb 25 '16 at 17:41
  • @WladimirPalant Thank you for bringing the XSS issue to my attention. I consider this a security bug in DOM implementations: No client-side script code should be executed before the element has been added to the document tree. However, although markup like `&;` is Valid HTML_5_ (but _not_ Valid HTML _4.01_), it is an extreme edge case. Ignoring HTML tags to consider that then, would defeat the purpose of the function. And I am not ready to build a full HTML parser into it. So, after careful consideration, I have to reject your suggestion. – PointedEars Feb 28 '16 at 07:33
  • @WladimirPalant I have added a caveat instead. If one has the choice, one should prefer the approach you present in your answer instead. – PointedEars Feb 28 '16 at 08:40