9

I have an application that allows users to persist strings to a database and those strings may contain emojis. The problem I have is an emoji such as will get stored in MySQL as 😊

When I retrieve this string using a PHP MySQL client and render it in a web browser, it renders fine probably because the Content-Type is set to UTF-8. When I try to read the string in node.js, I get back what I think is the ISO8859-1 encoding a literal 😊. The charset on the table is set to latin1 and that's where I'm getting ISO8859-1 from.

What's the right way to encode the string in node.js so that I can see the emoji and not the encoding set by MySQL when I console.log the string?

randombits
  • 47,058
  • 76
  • 251
  • 433
  • 3
    How about simply fixing the column charset instead? – zerkms Mar 30 '17 at 01:53
  • @zerkms Unfortunately that's not an option at this time, ideally I'd be able to solve this on the client. I do understand that your suggestion is the ideal solution, however. – randombits Mar 30 '17 at 01:59
  • It looks like you need to convert the unicode codepoints to the corresponding iso8859-1 charcters. Then use the array buffer to assemble a utf string from raw bytes. The latter can be find http://stackoverflow.com/a/11058858/251311 and for the former try https://github.com/ashtuchkin/iconv-lite Alternatively to the iconv-lite you might simply created the conversion table manually, since it's tiny – zerkms Mar 30 '17 at 02:08
  • @zerkms - ISO8859-1 has no codes for Emoji, so it cannot be shoehorned into that charset. – Rick James Apr 01 '17 at 18:18
  • @RickJames I did not say it has, my answer does the same what you suggested in mysql, but in js – zerkms Apr 01 '17 at 21:06

6 Answers6

13

😊 is Mojibake for . Interpreting the former as latin1, you get hex F09F988A, which is the UTF-8 hex for that Emoji.

(Note: UTF-8 outside MySQL is equivalent to utf8mb4 inside MySQL.)

In MySQL, you must have the column/table declared with CHARACTER SET utf8mb4. You must also state that the data being stored/fetched is encoded utf8mb4. Note: utf8 will not suffice.

Do a SELECT HEX(col) FROM ... to see if you get that hex for that Emoji. If that is the case and the column is currently latin1, then part of the fix is to carefully convert the column to utf8mb4. That is, you have CHARACTER SET latin1, but have UTF-8 bytes in it; this will leave bytes alone while fixing charset. Assuming the column is already VARCHAR(111) CHARACTER SET latin1 NOT NULL, then do this 2-step ALTER:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

Virtually any other conversion mechanism will make a worse mess.

As for establishing the connection correctly, it goes something like this for node.js:

var connection = mysql.createConnection({ ... , charset : 'utf8mb4'});
Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • If the column is a CHAR (vs VARCHAR) does any of this change? I'd imagine the answer is no, but when I convert all the Mojibake bytes turn into literal question marks `?` – randombits Apr 03 '17 at 14:36
  • See the link again -- question marks is caused by something else. Check the "best practice" list. `CHAR` tends to waste space unless the strings are fixed length. – Rick James Apr 03 '17 at 15:27
3

You do not need, and should not convert encoding. Just use the right protocols. If you send the HTML page in UTF-8, the browser will send the data back to your server in UTF-8.

Then you want to store the data to your database which is in latin1, that won't work at all. You must convert your database to UTF-8 as well. That includes the database, the tables, and eventually the columns themselves. Also make sure that your database client is configured to connect in UTF-8, because the client itself has to declare its encoding.

Once you have the whole data-flux in UTF-8, everything will work flawlessly.

Server -> GET HTML -> POST -> Server -> SQL Client -> Database -> Table -> Column

Guillaume F.
  • 5,905
  • 2
  • 31
  • 59
  • 2
    If you read the comments in the original post, you'll see that changing the database to UTF-8 is not an option at this time. – randombits Apr 01 '17 at 02:30
  • 4
    I see that, but it will have to be an option someday because using dirty tricks to convert encoding is only a gateway to hell opening. OP's team should talk about it urgently and take a decision accordingly. – Guillaume F. Apr 01 '17 at 02:33
3

It is recommended to use iconv(A simple ISO-8859-1 to UTF-8 conversion)

From this gist

var iconv = require('iconv');

function toUTF8(body) {
  // convert from iso-8859-1 to utf-8
  var ic = new iconv.Iconv('iso-8859-1', 'utf-8');
  var buf = ic.convert(body);
  return buf.toString('utf-8');
}

here if you pass anything in ISO-8859-1 , it will return it's UTF-8.

for example,

toUTF8("😊");

will return

Sagar V
  • 12,158
  • 7
  • 41
  • 68
2

I have found a super dirty way to convert it back:

    const isoToUtfTable = {
      'ð': 0xf0,
      'Ÿ': 0x9f,
      '˜': 0x98,
      'Š': 0x8a
    };
    
    function convertISO8859ToUtf8(s) {
      const buf = new Uint8Array([...s].map(c => isoToUtfTable[c]));
      return String.fromCharCode(...buf)
    }
    
    function decode_utf8(s) {
      return decodeURIComponent(escape(s));
    }
    
    console.log(decode_utf8(convertISO8859ToUtf8('😊')))

Now you simply need to complete the isoToUtfTable table (it's small, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1).

zerkms
  • 249,484
  • 69
  • 436
  • 539
  • There has to be a better way than to manually fill a lookup table. If this is the best answer I'll mark it as such, but I'm going to wait to see if there's a more reliable method first. – randombits Mar 30 '17 at 21:06
  • @randombits you may take https://github.com/ashtuchkin/iconv-lite or any other library to convert them. This is simply a way to do that without bringing dependencies. – zerkms Mar 30 '17 at 21:17
  • I've been trying to get it to work with iconv-lite, I don't mind adding a dependency - can't seem to get it back to UTF-8, though. Just displays the string literals that are in the database. – randombits Mar 30 '17 at 21:20
  • would it be possible to update your answer to also show how this is done using iconv-lite? – randombits Mar 30 '17 at 21:35
  • 1
    @randombits I tried using their https://github.com/ashtuchkin/iconv-lite/blob/master/encodings/sbcs-data-generated.js but apparently the 3 latest of 4 characters in `😊` are not that valid ISO-8859-1 characters. So, I don't know the other way of fixing it than just creating a tiny 255 elements (may be even 127 if the lower half is the same) table. – zerkms Mar 30 '17 at 21:54
1

Maybe try to look at node-iconv.

const iconv = new Iconv('ISO-8859-2', 'UTF-8');
const buffer = iconv.convert(something);
console.log(buffer);
console.log(buffer.toString('UTF8'));
Samuel Tulach
  • 1,319
  • 13
  • 38
1

This is a full answer from @zerkms solution

 const isoToUtfTable = {
'€':0x80,
'na':0x81,
'‚':0x82,
'ƒ':0x83,
'„':0x84,
'…':0x85,
'†':0x86,
'‡':0x87,
'ˆ':0x88,
'‰':0x89,
'Š':0x8a,
'‹':0x8b,
'Œ':0x8c,
'na':0x8d,
'Ž':0x8e,
'na':0x8f,
'na':0x90,
'‘':0x91,
'’':0x92,
'“':0x93,
'”':0x94,
'•':0x95,
'–':0x96,
'—':0x97,
'˜':0x98,
'™':0x99,
'š':0x9a,
'›':0x9b,
'œ':0x9c,
'na':0x9d,
'ž':0x9e,
'Ÿ':0x9f,
'NSBP':0xa0,
'¡':0xa1,
'¢':0xa2,
'£':0xa3,
'¤':0xa4,
'¥':0xa5,
'¦':0xa6,
'§':0xa7,
'¨':0xa8,
'©':0xa9,
'ª':0xaa,
'«':0xab,
'¬':0xac,
'SHY':0xad,
'®':0xae,
'¯':0xaf,
'°':0xb0,
'±':0xb1,
'²':0xb2,
'³':0xb3,
'´':0xb4,
'µ':0xb5,
'¶':0xb6,
'·':0xb7,
'¸':0xb8,
'¹':0xb9,
'º':0xba,
'»':0xbb,
'¼':0xbc,
'½':0xbd,
'¾':0xbe,
'¿':0xbf,
'À':0xc0,
'Á':0xc1,
'Â':0xc2,
'Ã':0xc3,
'Ä':0xc4,
'Å':0xc5,
'Æ':0xc6,
'Ç':0xc7,
'È':0xc8,
'É':0xc9,
'Ê':0xca,
'Ë':0xcb,
'Ì':0xcc,
'Í':0xcd,
'Î':0xce,
'Ï':0xcf,
'Ð':0xd0,
'Ñ':0xd1,
'Ò':0xd2,
'Ó':0xd3,
'Ô':0xd4,
'Õ':0xd5,
'Ö':0xd6,
'×':0xd7,
'Ø':0xd8,
'Ù':0xd9,
'Ú':0xda,
'Û':0xdb,
'Ü':0xdc,
'Ý':0xdd,
'Þ':0xde,
'ß':0xdf,
'à':0xe0,
'á':0xe1,
'â':0xe2,
'ã':0xe3,
'ä':0xe4,
'å':0xe5,
'æ':0xe6,
'ç':0xe7,
'è':0xe8,
'é':0xe9,
'ê':0xea,
'ë':0xeb,
'ì':0xec,
'í':0xed,
'î':0xee,
'ï':0xef,
'ð':0xf0,
'ñ':0xf1,
'ò':0xf2,
'ó':0xf3,
'ô':0xf4,
'õ':0xf5,
'ö':0xf6,
'÷':0xf7,
'ø':0xf8,
'ù':0xf9,
'ú':0xfa,
'û':0xfb,
'ü':0xfc,
'ý':0xfd,
'þ':0xfe,
'ÿ':0xff }





let offsetArray = [];
function convertISO8859ToUtf8Simple(s) {
    offsetArray = [];
    const buf = new Uint8Array([...s].map((c, index) => 
        {
            if(isoToUtfTable[c]) {
                if(offsetArray.length > 0 && offsetArray[offsetArray.length -1]+3 < index) {
                    offsetArray.push(index);
                }
                if(offsetArray.length == 0) {
                    offsetArray.push(index);
                }
            }
            
            return isoToUtfTable[c];
        }
        
        ));
      return String.fromCharCode(...buf);
      
    }

    
function decode_utf8(s) {
      return decodeURIComponent(escape(s));
    }


function  emojiStringToArray(str) {
  split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
  arr = [];
  for (var i=0; i<split.length; i++) {
    char = split[i]
    if (char !== "" && !char.includes('\u0000')) {
      arr.push(char);
    }
  }
  return arr;
};


const string = 'hello 😌😌 with some emojis 😊 ';

function finalString(s) {
    const emojis = emojiStringToArray(decode_utf8(convertISO8859ToUtf8Simple(s)));

    for(let i = 0; i<offsetArray.length; i++){ 
        let position = 0;
        if (i == 0) {
            position = offsetArray[i];
        } else {
            position = (i * -3) + offsetArray[i] + (i);
        }
            s =  [s.slice(0, position), emojis[i], s.slice(position+4)].join('');
    }
    return s;
}

console.log(finalString(string));
Mo Hawash
  • 11
  • 1