4

Context

I have a char variable on which I need to apply a transformation (for example, add an offset). The result of the transformation may or may not overflow.
I don't really care of the actual value of the variable after the transformation is performed.
The only guarantee I want to have is that I must be able to retrieve the original value if I perform the transformation again but in the opposite way (for example, substract the offset).

Basically:

char a = 42;
a += 140; // overflows (undefined behaviour)
a -= 140; // must be equal to 42

Problem

I know that signed types overflow is undefined behaviour but it's not the case for unsigned types overflows. I have then chosen to add an intermediate step in the process to perform the conversion.

It would then become:

  1. char -> unsigned char conversion
  2. Apply the tranformation (resp. the reversed transformation)
  3. unsigned char -> char conversion

This way, I have the garantee that the potential overflow will only occur for an unsigned type.

Question

My question is, what is the proper way to perform such a conversion ?

Three possibilities come in my mind. I can either:

  • implicit conversion
  • static_cast
  • reinterpret_cast

Which one is valid (not undefined behaviour) ? Which one should I use (correct behaviour) ?

My guess is that I need to use reinterpret_cast since I don't care of actual value, the only guarantee I want is that the value in memory remains the same (i.e. the bits don't change) so that it can be reversible.

On the other hand, I'm not sure if the implicit conversion or the static_cast won't trigger undefined behaviour in the case where the value is not representable in the destination type (out of range).

I couldn't find anything explicitly stating it is or is not undefined behaviour, I just found this Microsoft documentation where they did it with implicit conversions without any mention of undefined behaviour.


Here is an example, to illustrate:

char a = -4;                                             // out of unsigned char range
unsigned char b1 = a;                                    // (A)
unsigned char b2 = static_cast<unsigned char>(a);        // (B)
unsigned char b3 = reinterpret_cast<unsigned char&>(a);  // (C)

std::cout << (b1 == b2 && b2 == b3) << '\n';

unsigned char c = 252;                                   // out of (signed) char range
char d1 = c;                                             // (A')
char d2 = static_cast<char>(c);                          // (B')
char d3 = reinterpret_cast<char&>(c);                    // (C')

std::cout << (d1 == d2 && d2 == d3) << '\n';

The output is:

true
true

Unless undefined behaviour is triggered, the three methods seem to work.

Are (A) and (B) (resp. (A') and (B')) undefined behaviour if the value is not representable in the destination type ?

Is (C) (resp. (C')) well defined ?

Fareanor
  • 5,900
  • 2
  • 11
  • 37
  • https://en.cppreference.com/w/cpp/numeric/bit_cast ? – Raildex Mar 21 '22 at 13:20
  • Just cast unsigned to signed (which only results in negative number if the highest-bit (negative-mark) is set). – Top-Master Mar 21 '22 at 13:22
  • Does this answer your question? [C++ Implicit Conversion (Signed + Unsigned)](https://stackoverflow.com/questions/17832815/c-implicit-conversion-signed-unsigned) – Top-Master Mar 21 '22 at 13:24
  • @Top-Master Not really, I'm not interested in comparison/operations, only in conversion (for example if I want to convert `std::basic_string` to `std::string` and vice versa). – Fareanor Mar 21 '22 at 13:29
  • @Raildex It may be a solution, I didn't know C++20 added such a feature. – Fareanor Mar 21 '22 at 14:33
  • 1
    `a += 140; // overflows (undefined behaviour)` is **not** signed integer overflow, not UB. It is implementation defined behavior, when `char` is _signed_ and 8-bit - to assign a value outside the `char` range. – chux - Reinstate Monica Mar 21 '22 at 15:14
  • @chux-ReinstateMonica Does it mean that my first sample was legal ? If it is implementation defined, is it guaranteed that I will always get `42` (the original value) at the end ? – Fareanor Mar 21 '22 at 15:18
  • @Fareanor [Comment](https://stackoverflow.com/questions/71558263/proper-way-to-perform-unsigned-signed-conversion/71559966#comment126475068_71558263) expanded in answer below. – chux - Reinstate Monica Mar 21 '22 at 15:22

2 Answers2

2

I know that signed types overflow is undefined behaviour,

True, but does not apply here.

a += 140; is not signed integer overflow, not UB. That is like a = a + 140; a + 140 does not overflow when a is 8-bit signed char or unsigned char.

The issue is what happens when the sum a + 140 is out of char range and assigned to a char.

Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised. C17dr § 6.3.1.3 3

It is implementation defined behavior, when char is signed and 8-bit - to assign a value outside the char range.

Usually the implementation defined behavior is a wrap and fully defined so a += 140; is fine as is.

Alternatively the implementation defined behavior might have been to cap the value to the char range when char is signed.

char a = 42;
a += 140;
// Might act as if
a = max(min(a + 140, CHAR_MAX), CHAR_MIN);
a = 127;   

To avoid implementation defined behavior, perform the + or - on a accessed as a unsigned char

*((unsigned char *)&a) += small_offset;

Or just use unsigned char a and avoid all this. unsigned char is defined to wrap.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • "Usually the implementation defined behavior is a wrap".... but the standard also allows behaviors such as saturating, which I believe actually is the chosen behavior for some GPU cross-compilers. – Ben Voigt Mar 21 '22 at 15:30
  • @BenVoigt True, that case recently added. – chux - Reinstate Monica Mar 21 '22 at 15:31
  • @chux-ReinstateMonica Is the `reinterpret_cast` way equivalent as your sample to avoid the implementation defined behaviour ? I mean `reinterpret_cast(a) += small_offset;` – Fareanor Mar 21 '22 at 15:40
  • @Fareanor Yes. Treat `a` as `unsigned char`. Sorry for the C-ish answer to a C++ question. IAC, just use `unsigned char a` and all is well and _simply_. – chux - Reinstate Monica Mar 21 '22 at 15:42
  • 1
    I agree `a += 140` doesn't overflow, but could you clarify your reasoning. I assume it's because the literal 140 is an int and thus `a` is promoted to int and thus the addition is performed on two small ints. If that's it, then if you had `a += 'x';` you might face an actual overflow, right? – Adrian McCarthy Mar 21 '22 at 15:45
  • 1
    @AdrianMcCarthy "the addition is performed on two small ints" --> Yes. With `char + char` in C++, I am going to have to research. In C, both are promoted to `int` (well the `'x'` in a `int` there) and then added. In C++, I suspect that does not apply and is as you suggest. – chux - Reinstate Monica Mar 21 '22 at 15:49
  • @chux-ReinstateMonica: In C++ both are also promoted to `int`. I think Adrian is suggesting that the numeric value of `x` might be quite large... – Ben Voigt Mar 21 '22 at 15:53
  • To avoid UB of signed overflow, a larger offset could be made _unsigned_ as in `a += 1234567890u;` to avoid the UB issue. IAC, the cleanest code is to forego `char a;` and use `unsigned char a;`. – chux - Reinstate Monica Mar 21 '22 at 16:04
  • 1
    @AdrianMcCarthy No, even 2 chars will undergo [Integer promotions](https://en.cppreference.com/w/c/language/conversion#Integer_promotions) (as a part of [Usual arithmetic conversions](https://en.cppreference.com/w/c/language/conversion#Usual_arithmetic_conversions)) and will be converted to int. – danadam Mar 21 '22 at 16:07
  • 1
    @danadam: Thanks. I thought this was an area where C++ diverged from C (e.g., integral promotions aren't applied to function arguments if there's an appropriate overload). Though you linked to a summary of the C standard promotion rules, I found the C++ standard describes the same behavior in the case of arithmetic. – Adrian McCarthy Mar 21 '22 at 18:21
  • Can someone link me list of all undefined-behavour? I mean, the whole C++ language should be deprecated (because of UB). – Top-Master Mar 22 '22 at 06:24
  • 1
    @Top-Masterp UB comes in 2 flavors: Behavior that is explicitly defined as UB and behavior that is not defined by the spec. "list of all undefined-behavour" of the first type is discernable by the spec itself - I am sure sites exists that enumerate them in a condensed way. Of the 2nd type, example UB could be listed, but not fully enumerated as it is _not defined_. [What are all the common undefined behaviours that a C++ programmer should know about?](https://stackoverflow.com/q/367633/2410359) is a starting point. – chux - Reinstate Monica Mar 22 '22 at 09:16
1

For full portability, you do have a small problem insofar as (except for char1) signed data types have not been2 required to have as many distinct values as their unsigned counterparts. Very few systems actually used sign-magnitude representation for integral types, but if you cannot rule them out, then simply doing the math in the unsigned counterpart does not actually guarantee round-tripping, even if you use numeric_limits<?>::min() to try to avoid conversion of unrepresentable values.

With that caveat out of the way, the direct answer to your question is that both implicit conversion and static_cast are correct (and equivalent) for converting a value between its signed and unsigned counterpart types. In the signed->unsigned direction, the behavior is well-defined by the Standard, while in the other direction the behavior is implementation-defined.


1 char and signed char themselves are rescued from this possibility by their endorsement for access to the byte representation of any object, including to unsigned objects which are required not to have any missing values.

2 Two's complement conversion behavior is required in the latest version of C++, see https://eel.is/c++draft/basic.fundamental#3

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 1
    I think the C++11 standard "fixed" the problem of signed char possibly not being able to represent every value of an unsigned char. cppreference.com agrees: "For every value of type unsigned char in range [0, 255], converting the value to char and then back to unsigned char produces the original value. (since C++11)" The fix applies only to char types, not other integral types. – Adrian McCarthy Mar 21 '22 at 15:32
  • So for `unsigned` ->`signed`, it is implementation defined, which means, I don't have the guarantee to get back my original value (up to the implementation to give me that guarantee but it won't be portable). Or maybe (from the footnote) `char` is an exception (so that for `unsigned char` -> `char`, I would still have this guarantee ?) – Fareanor Mar 21 '22 at 15:34
  • @AdrianMcCarthy: character types have been special in that way long before C++11, but if that statement is an accurate paraphrase, then C++11 made it easier to make use of the guarantee. My footnote is valid for all standard C++ versions. – Ben Voigt Mar 21 '22 at 15:34
  • Aside: "including to unsigned objects which are required not to have any missing values." --> IIRC that only applies to `unsigned char`, _unsigned_ `char` and `uintN_t` types. Others may rarely have padding. – chux - Reinstate Monica Mar 21 '22 at 15:35
  • @Fareanor: In all versions, conversion with `memcpy` (or the legal type-pun shown at the bottom of chux's answer) will work for `char` and `signed char`. Trying to verify the claim quoted by Adrian. – Ben Voigt Mar 21 '22 at 15:36
  • @chux-ReinstateMonica: All unsigned types are required to have modulo arithmetic behavior, modulo a base which is a power of two determined by the number of bits in the representation. That permits non-representation bits (padding or parity or ECC) but forbids missing values. – Ben Voigt Mar 21 '22 at 15:37
  • @BenVoigt OK. Sounds like we are reading "missing values" values differently. – chux - Reinstate Monica Mar 21 '22 at 15:40
  • @chux-ReinstateMonica: Think of "signalling NaN" in IEEE floating-point. Until recently, C++ allowed signed integer types (except character types) to have similar trap values. And of course sign-magnitude is famous for having a missing value because zero consumes two representation patterns. – Ben Voigt Mar 21 '22 at 15:42
  • @BenVoigt When the answer used "are required not to have any missing values.", I read that as the number of distinct signed values equals the number of distinct unsigned values. With wider than `char` and 2's complement encoding that is overwhelming true. C specifies the unsigned type range only as at least as big as the positive signed range. E.g. 128 native 2's comp. signed type machine with a 127 bit unsigned (1 padding). I am applying C thoughts to a C++ question - risky, yet I thought C++ inherited that and so serves as a rare counter example. – chux - Reinstate Monica Mar 21 '22 at 16:00
  • @chux: Formerly that was true in C++: You could have `numeric_limits::min() = -numeric_limits::max()` (-65535 and +65535 respectively would be a 16-bit sign-magnitude platform). Latest C++ requires two's complement behavior for both directions of the signed/unsigned conversion -- if the underlying hardware has any other representation, then the compiler must hide that from the programmer, and sign-magnitude representation is no longer permissible, because it is missing support for a value required by the conversion. – Ben Voigt Mar 21 '22 at 19:09
  • @BenVoigt My comments have nothing to do with 1s' complement nor sign-magnitude, but with 2's complement and potential padding bits in the _unsigned_ type. – chux - Reinstate Monica Mar 21 '22 at 19:31
  • @chux-ReinstateMonica: Ok, that can happen but entirely irrelevant to the discussion here. Signed type is required to have the same width as the unsigned type. All bits counted in the width are required to be available in the unsigned type, padding bits not counted in the width aren't important. So you can't have a 128 bit signed type and a 127 bit unsigned type. – Ben Voigt Mar 21 '22 at 19:35
  • My assertion is not about a 127-bit unsigned type, but with a 128 unsigned type with 1 padding bit paired with a 128-bit 2's complement type with no padding. Both have same width, both same max value. C++, at one time allowed such, perhaps no longer. – chux - Reinstate Monica Mar 21 '22 at 19:47
  • @chux-ReinstateMonica I see, neither C nor old versions of C++ require the width of signed and corresponding unsigned types to be equal. However, you are describing them wrong. "width" explicitly includes sign and value bits and excludes padding bits. Your hypothetical situation, which IS legal, has a 127 bit wide unsigned type occupying 128 bits of storage. – Ben Voigt Mar 21 '22 at 22:06