1

In plain C, by the standard there are three distinct "character" types:

  • plain char which one's signedness is implementation defined.
  • signed char.
  • unsigned char.

Let's assume at least C99, where stdint.h is already present (so you have the int8_t and uint8_t types as recommendable alternatives with explicit width to signed and unsigned chars).

For now for me it seems like using the plain char type is only really useful (or necessary) if you need to interface functions of the standard library such as printf, and in all other scenarios, rather to be avoided. Using char could lead to undefined behavior when it is signed on the implementation, and for any reason you need to do any arithmetic on such data.

The problem of using an appropriate type is probably the most apparent when dealing for example with Unicode text (or any code page using values above 127 to represent characters), which otherwise could be handled as a plain C string. However the relevant string.h functions all accept char, and if such data is typed char, that imposes problems when trying to interpret it for example for a display routine capable to handle its encoding.

What is the most recommendable method in such a case? Are there any particular reasons beyond this where it could be recommendable to use char over stdint.h's appropriate fixed-width types?

Jubatian
  • 2,171
  • 16
  • 22
  • 1
    This question is a little open-ended and subjective. I'll add my thoughts anyway. Personally I use `char` for *characters*, and array of (or pointer to) `char` for *strings* of characters. That include UTF-8 encoded characters. If a small generic byte-sized integer is needed I would use `int8_t` or `uint8_t`. – Some programmer dude Jan 04 '18 at 08:28
  • @Someprogrammerdude I see this (open-ended), but I am also seeing the `char` type occurring in various codebases in places where it seems to call out for triggering undefined behavior & other nasties. It would be nice if at least a few "where definitely not" points were made, or how to work with it when you decide on using it (such as if you needed to do some print routine yourself, of course as long as you are just tossing around an array between third-party API functions it doesn't matter much). – Jubatian Jan 04 '18 at 08:37

2 Answers2

3

The char type is for characters and strings. It is the type expected and returned by all the string handling functions. (*) You really should never have to do arithmetic on char, especially not the kind where signed-ness would make a difference.

unsigned char is the type to be used for raw data. For example memcpy() or fread() interpret their void * arguments as arrays of unsigned char. The standard guarantees that any type can be also represented as an array of unsigned char. Any other conversion might be "signalling", i.e. triggering exceptions. (ISO/IEC 9899:2011, section 6.2.6 "Representation of Types"). (**)

signed char is when you need a signed integer of char size (for arithmetics).


(*): The character handling functions in <ctype.h> are a bit oddball about this, as they cater for EOF (negative), and hence "force" the character values into the unsigned char range (ISO/IEC 9899:2011, section 7.4 Character handling). But since it is guaranteed that a char can be cast to unsigned char and back without loss of information as per section 6.2.6... you get the idea.

When signed-ness of char would make a difference -- the comparison functions like in strcmp() -- the standard dictates that char is interpreted as unsigned char (ISO/IEC 9899:2011, section 7.24.4 Comparison functions).


(**): Practically, it is hard to see how a conversion of raw data to char and back could be signalling where the same done with unsigned char would not be signalling. But unsigned char is what the section of the standard says. ;-)

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • I see normally you don't need to do arithmetic on characters of strings as usually you have sufficient environment (API) to do the low-level job. But what if you need to write parts of that API, too? (For example some special display or printout routine). Eh, probably I should add another question on this. So if you need strings, then use `char`. If you need to do low-level API for it, then ??? (other question, not relevant here). And obviously the API should then be designed to take `char` data. – Jubatian Jan 04 '18 at 08:46
  • @Jubatian: As a library implementor, you have to "agree" with the compiler on the signed-ness of char; this needs to be configured of course. But you really, really should not need to do arithmetics on `char`. I cannot really imagine a use-case for that, and I *have* implemented a (partial) C standard library. Even stuff like `isupper()` etc. is usually done via a cast to `unsigned char` and using that as index into a locale-defined lookup table. As for "other APIs", well, it's those APIs that decide which types they demand, but it should be `char` for strings and `unsigned char` for data. – DevSolar Jan 04 '18 at 09:00
  • An example where you definitely need to do arithmetic is decoding UTF-8. Another one could be an interpreter, I often had to code such for microcontrollers accepting some plaintext based protocol, accessible through serial terminal, I don't think you could always work yourself around everything there with the standard C library and tables (and the latter might not even be viable due to the ROM space cost!). – Jubatian Jan 04 '18 at 09:06
  • @Jubatian: Ahahaha... my words come back to bite me, don't they? Multibyte support is one of the parts of the lib I never got around to implement. ;-) The point here is that you are always guaranteed you can cast a `char` to `unsigned char` and back without loss of information or triggering exceptions. So, if you need to do arithmetics, handle `char` as `unsigned char`, and do your thing (as e.g. strcmp() does when it comes to "smaller / larger"). – DevSolar Jan 04 '18 at 09:15
  • I feel a bit reluctant to using `unsigned char`, at least the MISRA (2012) coding standard does not recommend using the basic types for a reason. `uint8_t` may, however, not be equivalent. It can not be wider though (as the C standard requires char to be at least 8 bits), but it might be narrower on a DSP or some other weird thing (if `uint8_t` even existed there). Maybe really `unsigned char` is the most appropriate there (MISRA or no MISRA) with a lot of care for that it could be wider. The times when I feel safer doing it in assembler. – Jubatian Jan 04 '18 at 13:32
  • @Jubatian: Actually the exact-width types (like `uint8_t`, `int16_t` etc.) aren't even guaranteed to exist on every given platform, only the "fast" and "least" types are... so `unsigned char` is really the only "fully conforming" way to do it. – DevSolar Jan 04 '18 at 13:45
  • @DevSolar: On every implementation where `uint8_t` exists, it's guaranteed to be the same size as `unsigned char`. The real problem with it is that the Standard would allow an implementation to treat `uint8_t` as an extended integer type with a size of 1 that isn't alias-compatible with `unsigned char`. Having such a type would be useful for performance in some cases, and if compatibility with other code weren't an consideration, `uint8_t` would be a logical name for it. – supercat Jan 25 '18 at 21:09
1

Use char to store characters (standard defines the behaviour for basic execution character set elements only, roughly ASCII 7-bit characters).

Use signed char or unsigned char to get the corresponding arithmetic (signed or unsigned arithmetic have different properties for integers - char is an integer type).

This doesn't means that you can't make arithmetic with raw chars, as stated:

6.2.5 Types - 3. An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

Then if you only use character set elements arithmetic on them is correctly defined.

Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69
  • [Why do C++ streams use char instead of unsigned char?](https://stackoverflow.com/q/277655/995714) – phuclv Jan 04 '18 at 08:48
  • What about UTF-8 and encodings using the above 127 range? Those which otherwise are still capable to behave like ordinary C strings, and could potentially be used with `string.h` functions. – Jubatian Jan 04 '18 at 08:59
  • As you can convert values of these types as you want, there is no problem at all, if you want to see them as `signed`/`unsigned` then cast appropriately. – Jean-Baptiste Yunès Jan 04 '18 at 09:01
  • You meant what does happens with `strcmp` functions? Manual says: `The comparison is done using unsigned characters, so that '\200' is greater than '\0'` – Jean-Baptiste Yunès Jan 04 '18 at 09:02
  • Not that, just that then you (also) say that if it behaves like a C string, then use C strings of chars to handle it. OK, understood. – Jubatian Jan 04 '18 at 09:08