How to convert unicode characters to HTML numeric entities using plain Javascript

Question

I'm trying to convert innerHTML with special characters into their original &#...; entity values but can't seem to get it working for unicode values. Where am I going wrong?

The code is trying to take "Orig" - encode it and place it into "Copy"....

Orig: 1:__2:__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:

Copy: 1:�__2:�__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:�

... but obviously the dreaded black diamonds are appearing!

if (!String.prototype.codePointAt) {
  String.prototype.codePointAt = function(pos) {
    pos = isNaN(pos) ? 0 : pos;
    var str = String(this),
      code = str.charCodeAt(pos),
      next = str.charCodeAt(pos + 1);
    // If a surrogate pair
    if (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF) {
      return ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000;
    }
    return code;
  };
}

/**
 * Encodes special html characters
 * @param string
 * @return {*}
 */
function html_encode(s) {
  var ret_val = '';
  for (var i = 0; i < s.length; i++) {
    if (s.codePointAt(i) > 127) {
      ret_val += '&#' + s.codePointAt(i) + ';';
    } else {
      ret_val += s.charAt(i);
    }
  }
  return ret_val;

}

var v = html_encode(document.getElementById('orig').innerHTML);
document.getElementById('copy').innerHTML = v;
document.getElementById('values').value = v;
//console.log(v);

div {
    padding:10px;
    border:solid 1px grey;
}
textarea {
    width:calc(100% - 30px);
    height:50px;
    padding:10px;
}

Orig:<div id='orig'>1:__2:__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div>
Copy:<div id='copy'></div>
Values:<textarea id='values'></textarea>

(A jsfiddle is available at https://jsfiddle.net/Abeeee/k6e4svqa/24/)

I've been through the various suggestions on How to convert characters to HTML entities using plain JavaScript, including the he.js which looks the most favourable, but when I downloaded that script it doesn't compile (Unexpected Token around line 32: .. var encodeMap = <%= encodeMap %>;).

I'm not sure where to go with this.

But _why_ would you need to do this? Just make sure your HTML file is saved as utf8 document (which will almost certainly already be the case if you use any of the even mildly popular modern text/code editor), and make sure it contains `` so the browser renders it correctly. — Mike 'Pomax' Kamermans, Oct 15 '21 at 16:43
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt — Teemu, Oct 15 '21 at 16:45
Note sure how these comments help. Mike - if you try https://jsfiddle.net/Abeeee/k6e4svqa/28/ and simply copy and paste a smiley face onto the end of the "Orig" field then the problem continues - with the meta tag in place. Teemu ... why are you showing this link? — user1432181, Oct 15 '21 at 16:59

Old Geezer · Accepted Answer · 2021-10-16T01:41:08.553

Javascript strings are UTF-16. A character in the surrogate range takes up two 16-bit words. The length property of a string is the count of the number of 16-bit words. Thus "".length will return 2.

codePointAt(i) is not the ith character, but the ith 16-bit word. Hence, a surrogate character will appear over two consecutive codePointAt invocations. From the specs, if "".toString(0) is the high surrogate, the function will return the code point value, ie 128578, but "".toString(1) will return only the lower surrogate 56898, that black diamond.

Thus you need to skip one position if codePointAt returns a high surrogate.

Following the example in the specs, instead of iterating through each 16-bit word in the string, use a method that loops through each character. for let (char in aString) {} does just that.

function html_encode(s) {
  var ret_val = '';
  for (let char of s) {
    const code = char.codePointAt(0);
    if (code > 127) {
      ret_val += '&#' + code + ';';
    } else {
      ret_val += char;
    }
  }
  return ret_val;
}

let v = html_encode(document.getElementById('orig').innerHTML);
document.getElementById('copy').innerHTML = v;
document.getElementById('values').value = v;

div {
    padding:10px;
    border:solid 1px grey;
}
textarea {
    width:calc(100% - 30px);
    height:50px;
    padding:10px;
}

Orig:<div id='orig'>1:__2:__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div>
Copy:<div id='copy'></div>
Values:<textarea id='values'></textarea>

Thanks Old Geezer - the "if (code > 65535) i++;" did the trick. — user1432181, Oct 15 '21 at 17:20
@user1432181 I have modified the code to iterate through each *character* in the string instead of each 16-bit element. I am not sure if `codePointAt` handles endianess to ensure that the high surrogate always comes before the lower surrogate. I think it does. — Old Geezer, Oct 16 '21 at 01:57

How to convert unicode characters to HTML numeric entities using plain Javascript

1 Answers1

Linked