A practical algorithm depends largely on how the UTF-8 implementation of a certain data store sanitizes input strings.
- Are "overlong" byte sequences allowed?
- Are surrogates allowed?
- Are code points limited to the Unicode maximum of 0x10FFFF?
- Are all ASCII control chars allowed?
- Are any other Unicode characters disallowed?
Assuming only a check for the 0x10FFFF maximum, you get the following results for UTF-8 byte sequences of a certain length:
1-byte sequence
max code point: 0x7F
bits/code point: 7
bits/byte: 7
2-byte sequence
max code point: 0x7FF
bits/code point: 11
bits/byte: 5.5
3-byte sequence
max code point: 0xFFFF
bits/code point: 16
bits/byte: 5.33
4-byte sequence
max code point: 0x10FFFF
bits/code point: ~20
bits/byte: ~5
If the data store limits the number of bytes stored, you'll obviously want to store the data as ASCII to maximize the amount of binary input data.
The more interesting case is a data store that limits the number of Unicode "characters" (code points, actually). Here it's best to use 4-byte UTF-8 sequences. Many data stores accept all code points from 0x10000 to 0x10FFFF which allows to store 20 bits (2.5 bytes) of binary data per code point.
If the number of available code points is not a power of two, you'll essentially have to break up your input into a base-n number (with n ~ 1,000,000) for an optimal encoding.