C: char vs. unsigned char for non-ASCII text data

Question

This question:

does a great job of discussing char vs. unsigned char vs. signed char in C.

However, it doesn't directly address what should be used for non-ASCII text. Thus if I have an array of bytes that represents text in some arbitrary character set like UTF-8 or Big5 (or sometimes ASCII), should I use an array of char or unsigned char?

I'm leaning towards using char because otherwise gcc gives me warnings about signedness of pointers when the array is ASCII and I use strlen. But I would like to know what is correct.

Yes, the accepted answer is correct in that regard. And the C string-type is a sequence of non-NUL bytes terminated by a NUL-byte, of unspecified encoding, though use UTF-8 if you have any choice at all. — Deduplicator, Oct 24 '14 at 03:41
Except for UTF-8, you don't want to use strlen because there can be internal NULs. Anyway, this is a huge topic. For Unicode I suggest http://site.icu-project.org/ — Jim Balter, Oct 24 '14 at 03:41
@Deduplicator Please read more carefully: **Except for** UTF-8, ... there can be internal NULs. And what I meant is what the OP referred to: Big5 etc. Of course UTF-16 and UTF-32 can also contain internal NULs, so yes, what I wrote applies to them *to*. — Jim Balter, Oct 24 '14 at 03:44
@JimBalter: Do you know an 8bit-encoding with internal `NUL`s which are not that terminator? — Deduplicator, Oct 24 '14 at 03:46
An advantage of `unsigned char` over `char` (when `char` is signed) is with `is...()` functions which are UB for `signed char` as they expect a `unsigned char` or EOF. — chux - Reinstate Monica, Oct 24 '14 at 04:19
@PetrAbdulin NUL is an ASCII character 0. Since UTF-16 is non-ASCII it cannot have NUL by definition. It can't have zero characters either. (It can have subsets of 8 zero bits in a row but that doesn't matter) — M.M, Oct 24 '14 at 05:42
Craig, if the data is mixed of text and non-text you have to pick whichever option ends up in less casting :) — M.M, Oct 24 '14 at 05:45
@Matt McNabb Would you also say "Since EBCDIC is non-ASCII it cannot have NUL by definition"? Hopefully not since it's patently absurd. By NUL here, we mean '\0', which isn't restricted to ASCII -- nor is the standard C language, and therefore strlen etc. The warning about internal NULs/zero bytes in UTF-16, UTF-32, and other string encodings is legitimate because it may not be obvious to people who don't know everything there is to know about string encodings. A much longer discussion would cover wcslen etc. (and why those, too, can't be used with Unicode encodings.) — Jim Balter, Oct 24 '14 at 21:33
Chux - I see that _isalpha()_ and its cousins expect an _int_ (FreeBSD 8.4 man page, MKS web page, others) not an _unsigned int_. So...could you clarify? What does **UB** mean? — Craig S. Anderson, Oct 24 '14 at 22:23
@chux: The `is*()` functions do not have undefined behavior for `signed char` arguments. They have undefined behavior if the argument (which is of type `int`) is not within the range of `unsigned char` or equal to `EOF`. `isalpha((char)'A')` is well defined; `isalpha((char)-42)` has undefined behavior. — Keith Thompson, Aug 18 '15 at 16:21
OP, Note: Suggest using @ + "user" to insure timely notification as in "@chux" vs "Chux". Hopefully @Keith Thompson cleaner explanation answers your comment. -- Unclear why "not an _unsigned int_." mentioned. — chux - Reinstate Monica, Aug 18 '15 at 16:44

score 4 · Answer 1 · answered Oct 24 '14 at 03:48

4

Use normal char to represent characters. Use signed char when you want a signed integer type that covers values from -127 to +127 . Use unsigned char for having an unsigned integer type that has range of values from 0 to 255 .

answered Oct 24 '14 at 03:48

Dr. Debasish Jana

6,980
4
30
69

1

Technically `uint8_t` should be used for the latter (which will give a compiler error if you're on a platform that doesn't support 8-bit chars) – M.M Oct 24 '14 at 05:43
3

This doesn't say anything more (and quite a bit less) than the link that the OP provided. i.e., it isn't an answer to the question asked and doesn't even reflect that there was such a question. It's stunning that it got two upvotes. – Jim Balter Oct 24 '14 at 21:39

Petr Abdulin · Answer 2 · 2014-10-24T05:07:46.943

2

The question you are asking is probably much broader that you expect.

To answer it directly, most implementations use "byte" as underlying buffer. In that terms standard uint8_t typedef is your best bet. That is primarily because most character sets use variable number of bytes to store characters, so separate byte processing is essential in encoding and decoding process. It also simplifies conversion between different "endianess".

In general it's incorrect to use strlen on anything other than ASCII encoding or other single-byte code pages (0-255 range). It's certainly incorrect on any multi-byte encoding like Big5, UTF-8/16 or Shift-JIS.

edited Oct 24 '14 at 05:07

answered Oct 24 '14 at 04:36

Petr Abdulin

33,883
9
62
96

2

How is `strlen()` any less safe on UTF-8 vs. ASCII? Both has a code 0. C simple uses code 0 (ASCII NUL) as a terminating character, thus disallowing a C string with ASCII NUL characters. Code could equally use Unicode 0 (encoded in UTF-8 as a single 0 byte) as a terminating character. Not that I favor 0 terminated strings, but do not see a great concern using `strlen()` on a string-like group of UTF-8 encoded Unicode characters. – chux - Reinstate Monica Oct 24 '14 at 04:51
1

@chux UTF-8 is not a single byte character encoding. It will report incorrect number of characters. Maybe "safe" was not a best word. – Petr Abdulin Oct 24 '14 at 04:59
4

True that `strlen()` will not report the number of Unicode characters , but will safely report the correct number of `char` used in the UTF-8 encoding assuming a 0 termination. – chux - Reinstate Monica Oct 24 '14 at 05:02
@chux you are correct. My main point was that using `strlen` on mbcs is a bad idea in general. Usually we expect that `strlen` return number of characters, not number of bytes used. – Petr Abdulin Oct 24 '14 at 05:16
It would be better to talk here about strcpy, which *can* be used on multi-byte encodings as long as they can't have internal '\0'. – Jim Balter Oct 24 '14 at 21:38

score 0 · Answer 3 · answered May 12 '22 at 11:14

As far as UTF8 or any encoding where ASCII characters have the same codepoints, char is the best type for multi-byte characters string:

assume typedef char utf8:

This is the only way to allow char * to be used as utf8 * without an explicit cast. This is extremely common and a good enough reason to be better than unsigned char.

utf8 * could be accidentally passed to function expecting a pointer to a sequence of ASCII characters, but this could also be needed if you need to printf your utf8 string (which is a valid thing to do)

The main drawback is that as char sign is unknown, usage of arithmetic operators like > is unsafe, and the only safe way to check if a character is in the ASCII range is by checking the bit directly with ISASCII(c) ((c & (1 << 7) == 0)

C: char vs. unsigned char for non-ASCII text data

3 Answers3