1

Follow up
Can UTF-8 contain zero byte?

Can I safely store UTF8 string in zero terminated char * ?

I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.

Nick
  • 9,962
  • 4
  • 42
  • 80

2 Answers2

3

Yes.

Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).

As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.

unwind
  • 391,730
  • 64
  • 469
  • 606
1

In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.

To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.

It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.

So the most feasible solution is not to accept U+0000 in strings.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138