Trouble with Unicode encoding

Question

I'm trying to encode any 'special' characters in my Javascript before processing it, but I'm finding some special Unicode characters aren't encoding as I want. How do I handle this?

I want to simply throw an encoding=html.replace() at the text in order to retrieve something that can be sent through standard UTF-8 (I think).

My sample HTML reads:

<div id="source" contentEditable="true">&euro; A B C D &#x1F600; &#xD83D; &#x41;</div>
<pre id="trace"></pre>

And my Javascript reads:

function getEncoding(html) {
    return html.replace(/[\u00A0-\u9999<>\&]/gim, function(i) {
        return '&#x' + i.charCodeAt(0).toString(16) + ';';
    });
}

var _last='';
function tracer() {
    var html=document.getElementById('source').innerHTML;
    if ( html==_last ) return;
    _last=html;

    var encoding = getEncoding(html);
    var hex="";
    var n=1,c,p,i;
    for(p=0; p<html.length; p++) {
        c=html.substr(p,1);
        i=c.charCodeAt(0);

        if ( c.charCodeAt(0).toString(16)==20 ) hex+='<i>space</i> ';
        else hex+='<ul><span class="block">'
                +'char'+(n++)
                +') '+c
                +'('+c.charCodeAt(0)
                +'/#x'+c.charCodeAt(0).toString(16)
                +')</span> ';
    }

    console.log('encoding = '+encoding);
    document.getElementById('trace').innerHTML=encoding+'<HR>'+hex;
}

setInterval(tracer, 1000);

Everything encodes OK, except the yellow smiley face - as you see I've taken the output (xd83d) and put it into the "source" to see what it looks like, but it shows a black diamond question-mark, as it's obviously wrong. I understand the smiley face is Unicode and therefore two bytes, but I just don't know how to manage that in the replace logic (\u00A0-\u9999<>\&).

I have put this code into https://jsfiddle.net/Abeeee/uj7L38n5/2/ for a live demo.

So, how do I change getEncoding() to encode the string in order to be able to produce #x1f600 for "char6/7"?

`0xd83d,0xde00` is surrogate pair for (`U+1F600` Grinning face)… — JosefZ, Dec 04 '20 at 17:24
Unicode codepoints above U+FFFF require surrogate pairs when encoded in UTF-16, such as for Java(script) strings. But don't encode surrogate pairs in HTML entities, use full codepoints instead, eg U+1F600 should be encoded as `😀` or `😀`, don't encode it as `` — Remy Lebeau, Dec 04 '20 at 17:35

Trouble with Unicode encoding

0 Answers0