1

My current code converts characters into entities as expected. But if I convert emoji, then it generates something like �� for which doesn't render as expected.

String.prototype.toHtmlEntities = function() {
  return this.replace(/./gm, function(s) {
    // return "&#" + s.charCodeAt(0) + ";";
    return (s.match(/[a-z0-9\s]+/i)) ? s : "&#" + s.charCodeAt(0) + ";";
  });
};
console.log("".toHtmlEntities())

document.write("".toHtmlEntities())
mplungjan
  • 169,008
  • 28
  • 173
  • 236

1 Answers1

4

You're iterating over the code units of your string. Instead, you want to iterate over the code points. Most emojis consist of one code point, which is encoded by two code units called surrogate pairs - one high and one low one. Surrogate pairs when displayed standalone don't represent a valid symbol, which ends up with being rendered. If you use the u (unicode) flag on your regular expression, your . will then match based on the code points, allowing you to iterate over each code point (rather than code unit). You can then access the code point value using codePointAt(0), which you can then encode into a HTML entity:

String.prototype.toHtmlEntities = function() {
  return this.replace(/[^a-z0-9\s]/ugm, s => "&#" + s.codePointAt(0) + ";");
};
console.log("a".toHtmlEntities());
document.write("a".toHtmlEntities());

console.log("&".toHtmlEntities());
document.write("&".toHtmlEntities());

console.log("".toHtmlEntities()); // surrogate pair test
document.write("".toHtmlEntities());

console.log("‍‍‍".toHtmlEntities()); // ZWJ test
document.write("‍‍‍".toHtmlEntities());

console.log("❤️".toHtmlEntities()); // variation selector test
document.write("❤️".toHtmlEntities()); // variation selector test

console.log("ñ".toHtmlEntities()); // decomposed character test (length of 2)
document.write("ñ".toHtmlEntities()); // decomposed character test (length of 2)

console.log("ñ".toHtmlEntities()); // composed character (length of 1)
document.write("ñ".toHtmlEntities()); // composed character (length of 1)

If you just want to replace the emoji characters, you can use /\p{RGI_Emoji}/gv, or if you can't support that yet as v is a new flag, you can use \p{Emoji_Presentation} or \p{Emoji} to match those (or another regular expression to match your specific characters), and replace those with their code points, eg:

String.prototype.toHtmlEntities = function() {
  return this.replace(/\p{RGI_Emoji}/vgm, s => '&#' +s.codePointAt(0) + ";");
};
console.log("a".toHtmlEntities());
document.write("a".toHtmlEntities());

console.log("&".toHtmlEntities());
document.write("&".toHtmlEntities());

console.log("".toHtmlEntities()); // surrogate pair test
document.write("".toHtmlEntities());

console.log("‍‍‍".toHtmlEntities()); // ZWJ test
document.write("‍‍‍".toHtmlEntities());

console.log("❤️".toHtmlEntities()); // variation selector test
document.write("❤️".toHtmlEntities()); // variation selector test

console.log("ñ".toHtmlEntities()); // decomposed character test (length of 2)
document.write("ñ".toHtmlEntities()); // decomposed character test (length of 2)

console.log("ñ".toHtmlEntities()); // composed character (length of 1)
document.write("ñ".toHtmlEntities()); // composed character (length of 1)

As always, if you're going to be modifying the prototype of inbuilt JavaScript objects, ensure you know the consequences of doing so. It is instead recommended to create a new function and pass the string you want to convert into that function as an argument.

Nick Parsons
  • 45,728
  • 6
  • 46
  • 64