0

I mean it for putting it in VARCHAR column having utf8_bin collation. The thing with base64 is that the resulting text is longer than original and second issue is if I correctly understand this question the mysql limits varchar on number of chars not on their real byte size and since utf-8 encoded text can have chars that spans 3 bytes that could limit possibility of hitting the column boundary.

The resulting utf-8 text should like like gibberish chinese or mojibake as in here: Twitter image encoding challenge

But my question is precisely about if there is a maybe some kind of 'standardish' algorithm so I would not have to reinvent the wheel.

rsk82
  • 28,217
  • 50
  • 150
  • 240
  • Surely you can just store binary data in a VARCHAR anyway? From: https://dev.mysql.com/doc/refman/5.0/en/char.html: "Before MySQL 5.0.3, if you need a data type for which trailing spaces are not removed, consider using a BLOB or TEXT type. Also, if you want to store binary values such as results from an encryption or compression function that might contain arbitrary byte values, use a BLOB column rather than a CHAR or VARCHAR column, to avoid potential problems with trailing space removal that would change data values" – Alastair McCormack Sep 23 '15 at 12:25
  • Related: http://stackoverflow.com/questions/646974/is-there-a-standard-technique-for-packing-binary-data-into-a-utf-16-string – dan04 Sep 23 '15 at 23:22

2 Answers2

2

A practical algorithm depends largely on how the UTF-8 implementation of a certain data store sanitizes input strings.

  • Are "overlong" byte sequences allowed?
  • Are surrogates allowed?
  • Are code points limited to the Unicode maximum of 0x10FFFF?
  • Are all ASCII control chars allowed?
  • Are any other Unicode characters disallowed?

Assuming only a check for the 0x10FFFF maximum, you get the following results for UTF-8 byte sequences of a certain length:

1-byte sequence

max code point: 0x7F
bits/code point: 7
bits/byte: 7

2-byte sequence

max code point: 0x7FF
bits/code point: 11
bits/byte: 5.5

3-byte sequence

max code point: 0xFFFF
bits/code point: 16
bits/byte: 5.33

4-byte sequence

max code point: 0x10FFFF
bits/code point: ~20
bits/byte: ~5

If the data store limits the number of bytes stored, you'll obviously want to store the data as ASCII to maximize the amount of binary input data.

The more interesting case is a data store that limits the number of Unicode "characters" (code points, actually). Here it's best to use 4-byte UTF-8 sequences. Many data stores accept all code points from 0x10000 to 0x10FFFF which allows to store 20 bits (2.5 bytes) of binary data per code point.

If the number of available code points is not a power of two, you'll essentially have to break up your input into a base-n number (with n ~ 1,000,000) for an optimal encoding.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
1

All valid 4-byte UTF-8 sequences (or shorter) can be stored in MySQL's CHARACTER SET utf8mb4.

CHAR(10) CHARACTER SET utf8 occupies 30 bytes (3 bytes per character).
CHAR(10) CHARACTER SET utf8mb4 occupies 40 bytes.
VARCHAR(NN) CHARACTER SET XX occupies 1 or 2 bytes for length plus as many bytes as needed for up to NN characters set XX.

Although the utf8 encoding standard (and Unicode) allows for more than 4 bytes, no such characters have been established yet. (And MySQL can't yet handle them.) So, I would argue that utf8mb4 handles "all available and printable legal characters from utf-8".

The "collation" refers to comparisons and sort order, not encoding or storage.

Mojibake, and other flavors of gibberish, occurs when the right hand does not know what the left hand has. That is, when there is a mismatch in CHARACTER SET during handoff between client and server.

Use of VARBINARY is sweeping the real problem under the rug.

Rick James
  • 135,179
  • 13
  • 127
  • 222