75

Can UTF-8 string contain zerobytes? I'm going to send it over ascii plaintext protocol, should I encode it with something like base64?

einclude
  • 1,026
  • 1
  • 9
  • 14
  • 7
    UTF-8 uses 8 bits so you can't send it over ASCII (7-bit) plaintext. Base64 encoding would help. Not because of null bytes, though. – Tim Pietzcker Aug 02 '11 at 04:40

3 Answers3

115

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

The possible code points and their UTF8 encoding are:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

You may need to be careful that your ascii plaintext protocol doesn't treat non-ASCII characters badly (since that will be all non-ASCII code points).

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 9
    Pacerier, there is no such thing as _invalid_ UTF8. By definition, if it's not valid, it's not UTF8 :-) – paxdiablo Jan 26 '15 at 19:41
  • 6
    The definition of UTF-8 has been so overloaded by too many to mean "bytes to be intepreted as UTF-8" instead of the original "bytes according to UTF-8". – Pacerier Jan 30 '15 at 19:52
  • 3
    Pacerier, you raise a good point, and that may be the case, but then they're just _wrong._ As wrong as people who try to claim EBCDIC is ASCII, COBOL is C, or French is Swahili :-) I can see _no_ reasonable interpretation that would call something UTF8 if it wasn't actually valid according to the UTF8 rules. If it's not _valid_ UTF8, then it just some sort of arbitrary bytestream. – paxdiablo Jan 31 '15 at 05:04
  • 1
    It's far more likely that a program will deal with a byte stream claiming to be UTF-8 that is (strictly) not than a Parisian will demand a *pain au chocolat* in Kenya. While both are possible, the former is something warranting consideration while writing code. – Eric J. Oct 26 '15 at 15:18
  • Though it's always good to keep in mind invalid binary that is supposed to be interpreted as UTF-8, else you're opening yourself up to attacks along the lines of PHP's [Poison NUL Byte](http://hakipedia.com/index.php/Poison_Null_Byte) attack. – Qix - MONICA WAS MISTREATED Nov 24 '15 at 12:11
  • @paxdiablo so with your hard-line approach, would you go as far as to say that e.g. a postgresql text field does not support UTF-8? Postgres fails if you try to insert a string with a null byte, but in their docs it seems to go without saying that they support UTF-8 text since there are tons of references to it. – danny Jun 22 '16 at 00:56
  • In the range: U+000800-U+00ffff - can a UTF-8 stream not potentially contain a zero byte for a part of the codepoint? For example the first codepoint in that range is 0x0800 and the latter byte would be NUL even though this is not the end of the string. – gardarh Nov 01 '16 at 10:30
  • 4
    @gardarh: no, the UTF-8 encoding of 0x0800 is not `08, 00`, it's `e0, a0, 80`, with no zero byte in sight. See http://www.fileformat.info/info/unicode/char/0800/index.htm for more details but it's basically the first value in my third range in the answer, with *all* bytes having the high bit set, hence no possibility of `00`. – paxdiablo Nov 01 '16 at 12:01
  • So it's safe to assume a NULL byte would mark the end of a UTF-8 stream? – bryc Sep 15 '17 at 13:09
  • @bryc, a *stream* is not a well-defined thing. You could certainly have one that's similar to a C-style string, terminated by a `NUL` character. Or you could have a length value followed by that many characters. Or maybe you have some textual coding where you can use the ASCII-equivalent `ETX` or `FS` characters to indicate the end. – paxdiablo Sep 15 '17 at 13:33
  • 1
    It's worth pointing out that there's a [Modified UTF-8](https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) that encodes U+0000 as the two-byte sequence `\xC0\x80`. With Modified UTF-8, the null byte never appears in encoded text and it's safe to use the null byte to signal the end of an encoded text stream. (However, I doubt UTF-8--modified or not--would survive going over an ASCII plaintext protocol as OP wants. That 8th bit is kind of important!) – Ted Hopp Dec 30 '19 at 00:51
  • can we assume C style strings are safe to store UTF8 ? I know `strlen()` will return incorrect result, but storing and `printf()` seems to be OK. – Nick Jan 27 '20 at 12:18
  • 1
    @Nick, yes, you can store UTF-8 in C strings because of the fact that the only zero-octets allowed are for code point 0, `NUL`. The `printf` will work *if* whatever is receiving the octets can cope with UTF-8. A modified length calculator need only count the number of octets that start with the bit sequence `0` or `11` (< 128 or >= 192, the non-continuation octets). – paxdiablo May 01 '22 at 01:34
4

ASCII text is restricted to byte values between 0 and 127. UTF-8 text has no such restriction - text encoded with UTF-8 may have its high bit set. So it's not safe to send UTF-8 text over a channel that doesn't guarantee safe passage for that high bit.

If you're forced to deal with an ASCII-only channel, Base-64 is a reasonable (though not particularly space-efficient) choice. Are you sure you're limited to 7-bit data, though? That's somewhat unusual in this day.

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
  • You can use base-128 to deal with binary data in an UTF-8/ASCII-only channel, because the lower 128 byte values are all single-byte codepoints, AFAIK. – Janus Troelsen Feb 18 '14 at 11:12
3

A UTF-8 encoded string can have most values from 0x00 to 0xff in a given byte position for of backing memory (although a few specific combinations are not allowed, see http://en.wikipedia.org/wiki/UTF-8 and the octet values C0, C1, F5 to FF never appear).

If you are transporting across a channel such as an ASCII stream that does not support binary data, you will have to appropriately encode. Base64 is broadly supported and will certainly solve that problem, though it is not entirely efficient since it uses a 64 character space to encode data, whereas ASCII allows for a 128 character space.

There is a sourceforge project that provides base 91 encoding, which is more space efficient while avoiding non-printable characters http://base91.sourceforge.net/

Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • 1
    I don't think your first sentence is correct. The sequence `11111110` could only occur in a seven-unit sequence, which I believe is not specified, and `11111111` can *never` appear as far as I know. (How would it? Perhaps in a hypothetical extension to more than seven code units?) – Kerrek SB Aug 02 '11 at 08:00
  • You can use base-128 on ASCII or UTF-8 channels, that's even more efficient: http://stackoverflow.com/a/3956975/309483 – Janus Troelsen Feb 18 '14 at 11:09
  • Your first sentence is not correct. According to page 2 of [RFC 3629](https://tools.ietf.org/html/rfc3629) (an internet standard published in 2003-11), "The octet values C0, C1, F5 to FF never appear." –  Oct 26 '15 at 08:28
  • @Rhymoid: Thanks, I was not aware of that. Any idea why? Updated my answer accordingly. – Eric J. Oct 26 '15 at 14:41
  • 2
    @EricJ. C0 and C1 are invalid because they are part of overlong UTF-8 sequences (forbidden for its security implications; for instance, the sequence `[C0 80]` would encode U+0000 if it were allowed), F5 to FD are invalid because they encode invalid codepoints (the highest valid codepoint is U+10FFFF, making all sequences at most 4 octets in length), and FE and FF were never allowed in UTF-8. –  Oct 26 '15 at 15:12