How to do safe arithmetic on the char type

Question

In plain C, the char type is an at least 8 bits wide type with an implementation-defined signedness.

As the When to use the plain char type in C question's answers suggest, this type should be used when you have a data type which is string by intention and behaves like normal C strings, allowing to use for example string.h's functions from the standard library.

However there could be scenarios when you need to do arithmetic on such values. An example could be UTF-8 data for which you would have to write some type of processor or display routine yourself (no appropriate library solution being available on your target).

How this situation could be handled in the safest, most portable manner?

Declare the signedness explicitly? `signed char` and `unsigned char` have fully specified behavior. You can still use plain `char` too, just cast to the appropriate signedness version of `char` before using it for arithmetic purposes (important: You need a two step cast, to the appropriate signedness of `char`, then to the appropriate sized type if necessary; `(unsigned)mychar` will misbehave when `char` happens to be signed and the high bit is set unless you do `(unsigned)(unsigned char)mychar`). — ShadowRanger, Jan 04 '18 at 13:52
@ShadowRanger My current idea is roughly that: Within the routine needing to process such, typically having a pointer to char input, cast individual values to `unsigned char` or `signed char` as appropriate for any arithmetic. — Jubatian, Jan 04 '18 at 13:54
When doing arithmetics the ["*Usual Arithmetic Conversions*"](http://port70.net/~nsz/c/c11/n1570.html#6.3.1.8) apply. — alk, Jan 04 '18 at 13:59
No, if you stick to the practice of using type `char` only for character data, then you *never* need to perform arithmetic with values of that type. Characters are not numbers, so arithmetic with them is not well defined. That characters are represented as numbers in computer memory means that you technically *can* perform arithmetic on them, but in doing so you stop treating them as characters. You ought to convert to any other numeric type by an appropriate mechanism (maybe just an assignment or cast) if you want to perform arithmetic. — John Bollinger, Jan 04 '18 at 14:09
@JohnBollinger: "*so arithmetic with them is not well defined*" I object. "*You ought to convert to any other numeric type*" Why? C does this implicitly. Please see my previous comment. — alk, Jan 04 '18 at 14:14
Some standard library functions such as `strcmp` compare bytes as `unsigned char`, although the parameters are type `const char *`. I've always wondered if it is supposed to convert each `char` to `unsigned char` or whether it is allowed to alias the pointers to `const unsigned char *`. It shouldn't make much difference on systems where `char` is unsigned, or where `char` is signed and 2's complement, but might make a difference on systems where `char` is signed but not 2's complement. — Ian Abbott, Jan 04 '18 at 14:18
@ShadowRanger: `signed char` and `unsigned char` do not truly have fully specified behavior because they cannot be used in expressions without being promoted to `int`, which does not have completely specified behavior. For example, the behavior of `unsigned char a=1, b=253; a-b << 25;` is not defined if `int` is 32 bits, because the values are promoted to `int`, the result of subtraction is negative, and left shifts of negative values are undefined when the value cannot be represented. — Eric Postpischil, Jan 04 '18 at 14:28
@EricPostpischil: "*and left shifts of negative values are undefined when the value cannot be represented.*" true, but not related to `a` and `b` being `char`s. — alk, Jan 04 '18 at 14:32
@alk: In `a-b << 25`, the value being shifted is not a `char`. — Eric Postpischil, Jan 04 '18 at 14:32
@EricPostpischil: Also true, but this does not make arithmetic with `char`s involved (`(a-b) << 25`) undefined. I mean: The problem is not that `a` and `b` are `char`s. — alk, Jan 04 '18 at 14:34
@JohnBollinger: Per the C standard, the characters `0` to `9` are consecutive. The standard clearly anticipates doing arithmetic on these characters at least. — Eric Postpischil, Jan 04 '18 at 14:35
@alk: Yes, it does, because there is no arithmetic on `char`. No arithmetic expression in `C` operates on `char` values. — Eric Postpischil, Jan 04 '18 at 14:36
@EricPostpischil: ... "because" (or better "instead") the usual arithmetic conversions apply. So from the practical point of view I feel all is fine. The OP's question is "*How to do safe arithmetic on the char type*" the answer is: By *implicitly* treating them as `int`s . — alk, Jan 04 '18 at 14:41
I don't disagree, @EricPostpischil, but what the standard anticipates is not the point. My comments are in reference to [the code style / convention position the OP referenced](https://stackoverflow.com/a/48091506/2402272). I happen to subscribe to that position myself, though that's not really relevant either. — John Bollinger, Jan 04 '18 at 14:41
@alk: Re: “because…” Yes, the statement is true because of reasons. Nonetheless, it is true. Arithmetic on character types does not have fully specified behavior because there is no arithmetic on character types and because character types are automatically promoted to `int`, for which arithmetic is not fully specified. Merely “treating” them as `int` is not safe because `int` arithmetic is not fully specified and this is prone to human error. — Eric Postpischil, Jan 04 '18 at 14:47
@alk, "arithmetic with [characters] is not well defined" is a statement about characters in the abstract sense, not about their representations in computer memory. That should be clear from the rest of that comment. Even in terms of in-memory representations, however, arithmetic on characters is at best incompletely defined because it depends on the system's execution character set. — John Bollinger, Jan 04 '18 at 14:48
@EricPostpischil "The C standard does not say that `strcmp` compares bytes as `unsigned char`." Actually, it _does_ say that in section 7.24.4 Comparison functions, "The sign of a nonzero value returned by the comparison functions `memcmp`, `strcmp`, and `strncmp` is determined by the sign of the difference between the values of the first pair of characters (both interpreted as `unsigned char`) that differ in the objects being compared." My guess is that `strcmp` would alias the pointers for consistency with `memcmp`, but the standard doesn't explicitly say that. — Ian Abbott, Jan 04 '18 at 15:23
@EricPostpischil Following on from that, I think section 7.24 paragraph 3 implies that the pointers are aliased to `unsigned char *` since "every possible object representation is valid and has a different value", which wouldn't be the case for a 1's complement or sign and magnitude representation of (signed) `char`. — Ian Abbott, Jan 04 '18 at 15:34
Wow... Just a bit of clarification, although should be (have been?) clear by the question. Suppose I have the following function prototype: `PrintUTF8String(const char* str);`, for example on an ARM micro connected to some custom LCD display. I need to write an implementation for it. How should I handle the characters of `str` when it comes to decoding UTF-8? Or how else should I approach such a problem in a safe and portable manner? (although the other question I referred suggests me that I should use such a function prototype) — Jubatian, Jan 04 '18 at 16:11

score 0 · Answer 1 · answered Jan 04 '18 at 14:59

0

A largely safe way to operate on character values might be to use unsigned char types and to immediately cast them to unsigned in expressions (e.g., write (unsigned) a - (unsigned) b rather than a-b).

If you use a character type in an arithmetic expression, even unsigned char, it will be promoted to int¹, and arithmetic with int values is not fully specified in C (notably, the behavior upon overflow is undefined). Immediately casting each object to unsigned will effectively sidestep this, resulting in arithmetic on unsigned values, which is more fully defined.

This is not a perfect solution. It will result in cumbersome code, with numerous (unsigned) casts. And, of course, having defined behavior does not mean you will always get desired behavior—humans may still write expressions that wrap (instead of overflowing) when it is not desired. There is no way to eliminate all human error.

Footnote

¹ Per discussion elsewhere, it may be possible in an esoteric C implementation for char and int to be the same size, in which case unsigned char would be promoted to unsigned int. For all practical purposes, you can disregard this.

answered Jan 04 '18 at 14:59

Eric Postpischil

195,579
13
168
312

It isn't quite clear why you insist on unsigned arithmetic. What's wrong with the good old `int`? Unsigned is a PITA in general for arithmetic. Bit fiddling is beter with unsigned, but arithmetic and bit fiddling are two rather different things. – n. m. could be an AI Jan 04 '18 at 15:19
1

"it may be possible in an esoteric C implementation for `char` and `int` to be the same size" - yepp. E.g. C40 signal processors and C: _Since the TMS320C3x/C4x char is 32 bits (to make it separately addressable), a byte is also 32 bits. This yields results that you may not expect; for example, sizeof (int) == 1 (not 4). TMS320C3x/C4x bytes and words are equivalent (32 bits)._ (Copied from [MS320C3x/C4x Optimizing C Compiler User’s Guide](http://www.ti.com/lit/ug/spru034h/spru034h.pdf)) – Scheff's Cat Jan 04 '18 at 15:25
@n.m.: What is wrong with `int` is that `int` arithmetic is not fully specified, as the answer states. – Eric Postpischil Jan 04 '18 at 15:40
@EricPostpischil "int arithmetic is not fully specified" so? It's specified enough for most character-processing needs. Being fully specified is a great property, but being useful is still a necessary one. – n. m. could be an AI Jan 04 '18 at 16:00

How to do safe arithmetic on the char type

1 Answers1

Footnote