Get javascript node raw content

Question

I have a javascript node in a variable, and if I log that variable to the console, I get this:

"&#8203;asekuhfas eo"

Just some random string in a javascript node. I want to get that literally to be a string. But the problem is, when I use textContent on it, I get this:

asekuhfas eo

The special character is converted. I need to get the string to appear literally like this:

&#8203;asekuhfas eo

This way, I can deal with the special character (recognize when it exists in the string).

How can I get that node object to be a string LITERALLY as it appears?

@hon2a Well I meant a DOM node, in javascript, seemed redundant to clarify. — Joel Worsham, Nov 21 '14 at 15:26
To be more on point, you might want to check out http://stackoverflow.com/questions/18749591/encode-html-entities-in-javascript and encode `innerHtml`, if you want to show the special characters. — hon2a, Nov 21 '14 at 15:28

score 3 · Accepted Answer · answered Nov 21 '14 at 15:35

As VisionN has pointed out, it is not possible to reverse the UTF-8 encoding. However by using charCodeAt() you can probably still achieve your goal.

Say you have your textContent. By iterating through each character, retrieving its charCode and prepending "&#" as well as appending ";" you can get your desired result. The downside of this method obviously being that you will have each and every character in this annotation, even those do not require it. By introducing some kind of threshold you can restrict this to only the exotic characters.

A very naive approach would be something like this:

var a = div.textContent;
var result = "";
var treshold = 1000;
for (var i = 0; i < a.length; i++) {
  if (a.charCodeAt(i) > 1000)
    result += "&#" + a.charCodeAt(i) + ";";
 else 
    result += a[i];
}

This is especially useful considering I'm only interested in that ONE specific char code. So I can just replace `if (a.charCodeAt(i) > 1000)` with `if (a.charCodeAt(i) == 8203)` — Joel Worsham, Nov 21 '14 at 15:41

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

textContent returns everything correctly, as  is the Unicode Character 'ZERO WIDTH SPACE' (U+200B), which is:

commonly abbreviated ZWSP

this character is intended for invisible word separation and for line break control; it has no width, but its presence between two characters does not prevent increased letter spacing in justification

It can be easily proven with:

var div = document.createElement('div');
div.innerHTML = '&#8203;xXx';

console.log( div.textContent );                   // "xXx"
console.log( div.textContent.length );            // 4
console.log( div.textContent[0].charCodeAt(0) );  // 8203

As Eugen Timm mentioned in his answer it is a bit tricky to convert UTF characters back to HTML entities, and his solution is completely valid for non standard characters with char code higher than 1000. As an alternative I may propose a shorter RegExp solution which will give the same result:

var result = div.textContent.replace(/./g, function(x) {
    var code = x.charCodeAt(0);
    return code > 1e3 ? '&#' + code + ';' : x;
});

console.log( result );  // "&#8203;xXx"

For a better solution you may have a look at this answer which can handle all HTML special characters.

not sure this answers the OP's question regarding: `How can I get that node object to be a string LITERALLY as it appears?` OP wants to fetch the string as it appears in the html. (namely, the `` bit — ddavison, Nov 21 '14 at 15:19
Right, I get that. So I'm wondering if I can get that DOM node as it literally appears in some other way? Some sort of numeric or raw unicode encoding method that I'm not aware of perhaps? — Joel Worsham, Nov 21 '14 at 15:19

Get javascript node raw content

2 Answers2