Can I store UTF8 in C-style char array

Question

Can I safely store UTF8 string in zero terminated char * ?

I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.

score 3 · Accepted Answer · answered Jan 27 '20 at 12:43

Yes.

Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).

As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.

score 1 · Answer 2 · answered Jan 27 '20 at 12:52

In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.

To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.

It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.

So the most feasible solution is not to accept U+0000 in strings.

however, if U+0000 is in the string, then input string is not OK anyway. — Nick, Jan 27 '20 at 13:26
Yes nobody uses U+0000 or U+007F (backspace) or U+0008 for any practical purpose as _text_. — Joop Eggen, Jan 27 '20 at 13:38

Can I store UTF8 in C-style char array

2 Answers2