23

I have a use case where I need to convert signed values to unsigned, to make values sortable. I need this for char, short, int, long, and long long

By sortable, I mean that for signed type X, if (a < b) then converting to unsigned the converted(a) < converted(b). Note that in many cases, converting from a negative signed value directly to a unsigned value will make the value greater than 0 and breaks this constraint (two's complement implementations)

The simplest idea for a char is:

unsigned char convert(char x)
{
       return (unsigned char)(x ^ 0x80);  // flip sign to make it sortable
}

But this seems to be undefined behavior.

While it might be possible to convert to a larger type, add the types MIN value, and convert to the unsigned type, I'm not sure this is any more compliant, and won't work with long long

How can this be done without any undefined behavior for all types?

It seems safe to convert using memcpy, but it's not clear how to maintain sort order in a compliant way.

(Note that this is similar to: No compliant way to convert signed/unsigned of same size except I need the results to be maintain sort order)

Community
  • 1
  • 1
Glenn Teitelbaum
  • 10,108
  • 3
  • 36
  • 80
  • 1
    What is the use-case here? If you convert to a larger type, can you just make a special class to handle the output from `long long`, and/or throw an exception (or do other error-handling) when you encounter the a number that can't be converted? – Kyle Strand Nov 06 '15 at 18:40
  • More info needs to be added. What is the reason for having to convert it from a signed to an unsigned for sorting purposes? What is preventing you from sorting it as a signed value? – Adrian Nov 09 '15 at 16:59
  • @Adrian I need to shift the values - and it is undefined if the values are negative – Glenn Teitelbaum Nov 09 '15 at 17:26
  • So you want to left shift this signed value and you are concerned about portability? Then you could use multiply instead of a shift. If the compiler you are using is worth anything, it should optimize this into a shift under the hood. Sometimes, forcing an optimization can cause more problems including stopping the optimizer from doing its job. Of course, multiplying a signed value will cause other things that may not be wanted, such as distance changing, which may not be wanted. YMMV. – Adrian Nov 09 '15 at 18:16

5 Answers5

20

You are doing it wrong, because flipping the sign-bit of a signed value isn't actually defined.

Let's use two-bit types:

          00    01 10  11  Order for unsigned               0     1  2  3
10  11    00    01         Order for 2s complement -2 -1    0     1
    11 (10  00) 01         Order for sign-magnitude   -1 (-0 +0)  1
    10 (11  00) 01         Order for 1s-complement    -1 (-0 +0)  1

What you want to do is convert to unsigned (Which is always defined as value-preserving, with wrap-around), and then add a bias so the most-negative number becomes 0:

int x = whatever;
unsigned r = (unsigned)x - (unsigned)INT_MIN;

Take care: Signed overflow is not defined, so we avoided signed types.

Of course, it doesn't help if the unsigned type has fewer values than the signed one, which is allowed in general, though not for char.
And you need to take special care if you want to preserve a negative 0 as negative.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
13

This is not possible if you want to remain fully portable.

The range of unsigned int is only specified to at least cover the non-negative values of int. The standard allows for implementations where UINT_MAX == INT_MAX. The same applies to all other non-fixed-width integer types.

Given that the range of unsigned int may be smaller than that of int, the pigeonhole principle applies: you have no way of redistributing all values of int to corresponding but different values of unsigned int unless unsigned int can store at least as many different values as int.


To quote N4140 (roughly C++14):

3.9.1 Fundamental types [basic.fundamental]

1 [...] For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. [...]

3 For each of the standard signed integer types, there exists a corresponding (but different) standard unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", "unsigned long int", and "unsigned long long int", each of which occupies the same amount of storage and has the same alignment requirements (3.11) as the corresponding signed integer type47; that is, each signed integer type has the same object representation as its corresponding unsigned integer type. [...] The range of non-negative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the value representation of each corresponding signed/unsigned type shall be the same. [...]

This guarantees that you don't have a problem for unsigned char. There is no possibility for unsigned char to have any kind of padding bits. It wouldn't make sense for unsigned char to have padding bits: given unsigned char c;, how would you access those padding bits? reinterpret_cast<unsigned char &>(c)? That obviously just gives you c. The only thing similar to padding bits that is possible for unsigned char is something that's completely transparent to the program, for instance when ECC memory is used.

For all the other non-fixed-width integer type, from short to long long, the standard meaning of "subrange" allows for an equal range.

I think I vaguely recall reading that there may have been ancient CPUs that did not provide any native unsigned operations. This would make it very tricky for implementations to properly implement unsigned division, unless they declared that the would-be-sign-bit of unsigned types would be treated as a padding bit. This way, they could simply use the CPU's signed division instruction for either signed or unsigned types.

  • Ahhh, you mean like 1s compliment having 2 zeros (0,-0)? But there you still have more unsigned than signed. I thought `unsigned` always has `256^sizeof(x)` values, and its only `signed` that can have fewer – Glenn Teitelbaum Nov 06 '15 at 15:14
  • @GlennTeitelbaum I mean where the sign bit of signed types corresponds to a padding bit of unsigned types. –  Nov 06 '15 at 15:18
  • Gratulations on finding yet another weird corner-case. – Deduplicator Nov 06 '15 at 15:21
  • In those cases, is any access to the pad bit undefined? Or is setting it undefined or not allowed since it will make the value greater than UINT_MAX? By the way, is it defined for `char`, `short`, `long`, and `long long` and only int is the issue? – Glenn Teitelbaum Nov 06 '15 at 16:04
  • @GlennTeitelbaum In C, a padding bit never contributes to the value, but may make the representation a trap representation. If it does so, then reading an object in which the padding bit is set to the wrong value results in undefined behaviour. In C++, the effect is the same, but the wording is different. I'll try to update my answer based on the standard when I get the chance to check the details. This covers `short`, `long` and `long long`. `unsigned char` doesn't have this issue, since `unsigned char` must be able to represent any byte value. I'll look for the exact wording for that too. –  Nov 06 '15 at 16:08
  • @MSalters C does refer to any non-sign non-value bit as a padding bit, regardless of whether it enables trap representations, but C++ might not. If C++ has a different name for it, then when I update this answer, I'll use C++'s name. –  Nov 06 '15 at 16:44
  • @hvd: I might have been misled by [N2631](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2631.html) which stated that " padding **bytes** .. never cause a trapping representation". I should have double-check my assumptions. – MSalters Nov 06 '15 at 17:35
3

To keep the ordering that you want, you must add the same amount to all values such that

a) their relative differences are unchanged and

b) all negative values are turned into nonnegative values.

Adding a consistent amount is the only way to do this. If all the values you are sorting are originally of the same signed type T, then the amount to add to ensure any negative value becomes nonnegative must be "-numeric_limits::min()" or in other words you must subtract the minimum signed value, which is negative.

If you are bringing in different types into the same sort (e.g. sorting char values along with short, int, long, etc.), you might want to make the first step a conversion to the largest signed type you will handle. There is no loss of information to go from a smaller signed type to a larger signed type.

To avoid overflow problems, I would suggest doing the shift (i.e. subtracting the minimum) conditionally.

if (value < 0)

convert by first subtracting the minimum (making nonnegative) and then convert to the unsigned type (which is now completely safe)

else

first convert the already nonnegative value to the unsigned type (completely safe) and then add the same adjustment as a positive value i.e. add numeric_limits::max()+1

T for both is the original signed T. The expression "numeric_limits::max()+1" could be calculated and converted to the new destination type once and then used as a constant in type newT.

Thomas
  • 31
  • 1
2

I would subtract numeric_limits<T>::min() from each value. This preserves the ordering property you want, and if the underlying representation is 2's complement (i.e., the only sane representation, and the one that is in practice what every non-museum-resident computer uses) will do what you expect, including for the boundary cases when the input value is equal to the most negative or most positive representable integer -- provided that the compiler uses a SUB instruction, and not an ADD instruction (since the positive value -numeric_limits<T>::min() is too large to represent).

Is this standard compliant? No idea. My guess is: Probably not. Feel free to edit if you know.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • Cast the minimum value to unsigned before the subtraction; that will not only avoid Undefined Behavior, but it will also yield results independent of the machine's numerical representations since for any signed integers x and y where x>y, casting both to an unsigned type with at least the same number of value bits and subtracting will yield the numerical difference between them. – supercat Nov 06 '15 at 22:00
2

The formula x-(unsigned)INT_MIN will yield a suitable ranking on all machines where UINT_MAX > INT_MAX. For any pair of signed integers x and y, where x>=y, (unsigned)x-(unsigned)y will equal the numerical value of x-y; so if y is INT_MIN, then x>=y for all x, and the aformentioned formula will report the amount by which x is greater than INT_MIN, which is of course ranked the same as x.

supercat
  • 77,689
  • 9
  • 166
  • 211