3

Recently, I came across a weird behaviour while testing a backend system which uses MySQL.

In my frontend, when I printed the length of a string, I got a different value than from the MySQL.

For the test, I'm using this 0123456789, and ‍♀️ emojis to check the length of the string.

Javascript(In the browser console):

> "‍♀️".length
<- 7
> "".length
<- 2
> "0123456789".length
<- 10

enter image description here

MySQL(Using MySQL Workbench):

CREATE TABLE testing_char_length 
  ( 
     id  INT auto_increment PRIMARY KEY, 
     txt VARCHAR(10) 
  ) 
DEFAULT charset utf8mb4; 

INSERT INTO testing_char_length 
            (txt) 
VALUES     ('0123456789'), 
            (''), 
            ('‍♀️'); 

SELECT *, 
       Char_length(txt) NumberOfChars, 
       Length(txt)      SizeInBytes 
FROM   testing_char_length; 

DROP TABLE testing_char_length; 

Output:

| id    | txt           | NumberOfChars     | SizeInBytes   |
|----   |------------   |---------------    |-------------  |
| 1     | 0123456789    | 10                | 10            |
| 2     |             | 1                 | 4             |
| 3     | ‍♀️            | 5                 | 17            |

enter image description here


Initially I thought it might be due to some encoding issues. So, tried to search about the encodings used by Javascript and MySQL.

Javascript uses UTF-16

So, I also tried setting the MySQL encoding to utf16 to be sure it is not the culprit and even after using default charset utf16, the output is same like in utf8mb4.

Why is there a difference in the number of characters in Javascript and MySQL?

I'm sure there is something which I haven't come across till now and hoping for some answers for this behaviour.

Tested on:

Google Chrome Version 81.0.4044.122 (Official Build) (64-bit)

MySQL Workbench: 8.0.19

MySQL: 8.0.19

OS: macOS Catalina version 10.15.3


Update: 1

The HEX codes of string "0123456789" are same in both JS and MySQL. Other two are not same.

JS Fiddle for Hex values. Code to calculate HEX is taken from here

JS Hex values

MySQL Fiddle for Hex values (Taken from the comment below)

MySQL Hex values

Even though the encoding is same, why are the HEX values are different? Isn't UTF encoding supposed to be universal and same?

SkrewEverything
  • 2,393
  • 1
  • 19
  • 50
  • Interesting. [mysql fiddle](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=328a5dc9abda9d70f38f63a31c724494) of this. bytes was always going to be implementation dependent, but I was hoping chars would be universal. Can you get a hex encoding in javascript of these characters? – danblack May 14 '20 at 23:04
  • @danblack I have updated the answer with the Hex encodings in js. – SkrewEverything May 14 '20 at 23:42
  • I just tested your fiddle. has extra characters next to it in the fiddle that are not in your database. – StackSlave May 15 '20 at 00:32
  • [JS Fiddle for Hex values](https://jsfiddle.net/g3cex120/1/) is not correct. – StackSlave May 15 '20 at 00:39
  • @StackSlave I just checked. I think it’s a bug in the JS fiddle editor. It is showing like it has a space after the emoji but when I try to delete it, the emoji is being deleted. And also in the alert, it is showing fine. I have also copied the emoji from the JS fiddle along with the quotes to test whether it has a space or not and there was no space when I pasted that into the chrome console and some other text fields to double check. – SkrewEverything May 15 '20 at 01:48
  • Great. Glad you figured it out! – StackSlave May 15 '20 at 02:04
  • @StackSlave No. The problem is still there. I was just giving an explanation for your comment about inputs being different. The hex values are still not same even though the emojis are same. – SkrewEverything May 15 '20 at 02:08

1 Answers1

0

As you found out, your app language is using UTF-16, while MySQL is using its utf8mb4 (aka UTF-8).

UTF-8, UTF-16, and UTF-32 have different hex representations, potentially a different number of bytes and "characters", but can be transformed from one to another.

In the "BMP", 16-bit characters can be represented in 3 bytes in utf8 or 2 bytes in utf16. Beyond that, utf8 goes to 4 bytes (1 character) and utf16 goes to 4 bytes (2 characters). That 2nd character is saying "panic, we exceeded 16 bits, we need something to extend the character set".

Potentially, utf8 could go to 5 bytes, but no such characters have been standardized on.

Rick James
  • 135,179
  • 13
  • 127
  • 222