30

As per the C standard the value representation of a integer type is implementation defined. So 5 might not be represented as 00000000000000000000000000000101 or -1 as 11111111111111111111111111111111 as we usually assume in a 32-bit 2's complement. So even though the operators ~, << and >> are well defined, the bit patterns they will work on is implementation defined. The only defined bit pattern I could find was "§5.2.1/3 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.".

So my questions is - Is there a implementation independent way of converting integer types to a bit pattern?

We can always start with a null character and do enough bit operations on it to get it to a desired value, but I find it too cumbersome. I also realise that practically all implementations will use a 2's complement representation, but I want to know how to do it in a pure C standard way. Personally I find this topic quite intriguing due to the matter of device-driver programming where all code written till date assumes a particular implementation.

tinkerbeast
  • 1,707
  • 1
  • 20
  • 42
  • 3
    "...the values they will work on is implementation defined..." what does that mean? – Dan Byström Mar 09 '15 at 08:13
  • 1
    What are exactly trying to do? Answer to this question depends on operations you are doing. That being said, I don't think it's usual to encourter this issue on device driver level. If you switch to different hardware with different integer presentation, most likely you have to rewrite your driver anyway. – user694733 Mar 09 '15 at 08:23
  • 1
    Could you provide an example of what you like to accomplish? – Arjun Sreedharan Mar 09 '15 at 08:24
  • 1
    @JoachimPileborg : so what about Gray code for example? It is not "binary", but does the C standard forbid integers represented in Gray code? – Leonard Michlmayr Mar 09 '15 at 08:26
  • @LeonardMichlmayr Gray code is definitely binary (it only have two digits after all), but it's true that it's a possible representation where `5` is not equal to `101`. So I retract my earlier comment. – Some programmer dude Mar 09 '15 at 08:30
  • Just to print the binary *pattern* of any value is well-defined and easy to do. What the bits represent doesn't matter in that case. – Some programmer dude Mar 09 '15 at 08:32
  • 1
    To access a single bit, you can use `~(~0<<1) << n` instead of `1 << n`. If even `0` is not save, you have to use `(0^0)` instead. – Leonard Michlmayr Mar 09 '15 at 08:36
  • @Dan Byström I have edited the question to "...the bit-pattern they will work on is implementation defined...". What I mean is, how can you do bit operations when you do not know the underlying bit pattern? – tinkerbeast Mar 09 '15 at 12:06
  • @user694733 and arjun-sreedharan - What I'm trying to accomplish is establish a way to work a uniformly across implementations with bit patterns. I can think of very very few bit-operations where the person does not assume the underlying bit representation. Also I disagree on your viewpoint of hard-wares - The integer representation depends on the processor, not a peripheral controller which might be the same. – tinkerbeast Mar 09 '15 at 12:15
  • @tinkerbeast: You do know the bit patterns, at least for unsigned types. And I think this isn't a real-world issue and all non-ancient machines use 2's complement. Though I'm not sure if it's safe to assume signed right-shift carries the sign bit (but probably only on machines lacking such an instruction, if at all) and that unsigned-to-signed conversion preserves value modulo unsigned_maximum+1 (which I rarely rely on, except for the case where characters written by `fgetc` need to be converted back to `char`, which could be signed; but I'd expect `char` to be unsigned on such a platform). – mafso Mar 09 '15 at 14:48
  • @mafso: All non-ancient machines use sign-magnitude, it's part of the IEEE-754 standard. (Admittedly, the use of sign-magnitude doesn't carry over to integer types) – Ben Voigt Mar 09 '15 at 16:40
  • @BenVoigt: We're talking about integer types, not floating-point types. – mafso Mar 09 '15 at 16:44
  • 1
    @LeonardMichlmayr: yes, the C standard forbids Gray code. For positive integer representation it requires that each bit has a value, and that the values of each bit are powers of 2. The term used in normative text is "pure binary", which is defined in a footnote, "A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral powers of 2, except perhaps the bit with the highest position." 6.2.6.2/1 relaxes "successive", to allow for different byte orders. – Steve Jessop Mar 09 '15 at 18:34
  • Then it's probably the definitions of the unsigned shift operators, in terms of multiplication and division, that gives the simplest notion how the bitwise ops relate to the mathematical ones. – Steve Jessop Mar 09 '15 at 18:39
  • 1
    @SteveJessop: Is there anything in the standard which would specify that a type larger than `char` must have its bits stored sequentially within a sequence of `char` values, or would it be legitimate to have a machine which permuted the bits in some weird fashion? Is there any requirement that it must be possible to change one bit within an `unsigned long` by changing exactly one bit within exactly one of the underlying `unsigned char` values contained within it? – supercat Mar 09 '15 at 19:45
  • @supercat: To the first question, it doesn't say anything AFAIK about the order of bits in the storage representation (and indeed big-endian and little-endian implementations differ in this respect), but the questioner's asking about bitwise ops so the underlying bytes aren't relevant. To the latter I'm pretty sure that padding bits are allowed to form a checksum, and if that's correct then clearly there's no such requirement: doing as you describe could create a trap representation :-) – Steve Jessop Mar 09 '15 at 19:55
  • @SteveJessop: Is there required to be for each bit in an `unsigned int` an identifiable bit in the `unsigned char` sequence which would identify the state of that bit, and is there any practical usefulness to such requirement when not accompanied by stronger guarantees about storage format? For example, could an implementation with a 35-bit `unsigned long` type stored as five 8-bit `char` values specify that if the MSB of any byte wasn't set, all higher bits would be "presumed" zero, thus allowing `foo = 63;` to store one byte rather than five? – supercat Mar 09 '15 at 20:04
  • @supercat: I don't know without some standards-diving. Sorry, I'm not going to do it since it doesn't affect the answer to the question. I suspect it's not forbidden and not explicitly permitted either, therefore exhaustively searching everything it says about object representation of integers in order to prove a negative might be ... exhausting. In my first comment, I mean that Gray code is forbidden in the value representation. – Steve Jessop Mar 09 '15 at 20:07
  • @SteveJessop: I think what you really mean is that bitwise operators are designed to *behave as though* integers are represented as power-of-two sequence of bits. If no specified relation is required between `int` values and their decomposed form, I would think a system could use graycode provided that the compiler produces for the bitwise operators code that will compute the same values as those operators would on a "conventional binary" machine. Gray-code would be incompatible with many *expectations* regarding type punning, but I don't know what expectations are guaranteed by standard. – supercat Mar 09 '15 at 20:23
  • @supercat: I'm saying what the standard says, that integers are represented at more than one level. At the level the standard says the most about, which is also the level the questioner is asking about since it's where bitwise operations happen, Gray code is forbidden since it's not a pure binary representation. – Steve Jessop Mar 09 '15 at 20:31
  • @SteveJessop: The standard makes clear that 5 ^ 9 yields 12; so long as an implementation honors that, I don't see how that would forbid it from representing 5 as 0110, 9 as 1110, and 12 as 1001, unless the behavior of decomposition via type punning is specified. To be sure, 0110 xor 1110 wouldn't be 1110, but the "^" operator isn't defined in terms of underlying bits, but rather on values using modular arithmetic. – supercat Mar 09 '15 at 20:53
  • @supercat: the question is about bitwise operators. It doesn't matter whether the RAM chip has little 0s and 1s written on it, or something different, but that is *not* the "representation" the questioner is asking about. Yes, the implementation can use groups of bound electrons (or Gray code) instead of the string "101", if it likes, because `5` will still be represented as `101` in the representation on which the bitwise operators are defined to act. However, it cannot use Gray code in the representation the questioner is asking about. I don't think I can make it any clearer. – Steve Jessop Mar 09 '15 at 20:56
  • .. whether the `^` operator is defined on "underlying" bits, or just on the bits of the value representation, is for you to say since you introduced the terminology "underlying" ;-) – Steve Jessop Mar 09 '15 at 21:02
  • 1
    Is the question about "an implementation independent way of converting integer types to an *implementation independent* bit pattern?" or an "an implementation independent way of converting integer types to the platform dependent bit pattern?" The former is easier. You can always extract a consistent representation of any number using /2 and %2 (with come care regarding -ve). For example you could use those to format any number to its two's complement representation regardless of the platform representation. – Persixty Mar 10 '15 at 00:27
  • @DanAllen: I was interpreting the question as the latter, which cannot be done in a portable way, but could pose some interesting issue with "char-sequence" punning. For the former, one can in portable C reformat any representable number into almost any arithmetically-defined sequence of characters, subject only to memory limits. – supercat Mar 10 '15 at 15:58
  • @supercat. I'm sure it can be done but given all modern architectures are two's complement the real challenge will be finding a computer on which to test the code handling the other 2 representations! – Persixty Mar 11 '15 at 07:41
  • @DanAllen: I would guess padding is mainly a concession to architectures which are hostile toward multi-precision arithmetic (e.g. if arithmetic operations on a DSP all used saturating signed arithmetic). If math were limited to values -32768 to +32767, it may be easier to have `unsigned int` be a pair of values which were kept normalized to the range 0 to 16383, and have `unsigned long` be three or four such values, than to include the conditional logic necessary to do multi-precision arithmetic without overflowing intermediate calculations. – supercat Mar 11 '15 at 16:49

3 Answers3

21

In general, it's not that hard to accommodate unusual platforms for the most cases (if you don't want to simply assume 8-bit char, 2's complement, no padding, no trap, and truncating unsigned-to-signed conversion), the standard mostly gives enough guarantees (a few macros to inspect certain implementation details would be helpful, though).

As far as a strictly conforming program can observe (outside bit-fields), 5 is always encoded as 00...0101. This is not necessarily the physical representation (whatever this should mean), but what is observable by portable code. A machine using Gray code internally, for example, would have to emulate a "pure binary notation" for bitwise operators and shifts.

For negative values of signed types, different encodings are allowed, which leads to different (but well-defined for every case) results when re-interpreting as the corresponding unsigned type. For example, strictly conforming code must distinguish between (unsigned)n and *(unsigned *)&n for a signed integer n: They are equal for two's complement without padding bits, but different for the other encodings if n is negative.

Further, padding bits may exist, and signed integer types may have more padding bits than their corresponding unsigned counterparts (but not the other way round, type-punning from signed to unsigned is always valid). sizeof cannot be used to get the number of non-padding bits, so e.g. to get an unsigned value where only the sign-bit (of the corresponding signed type) is set, something like this must be used:

#define TYPE_PUN(to, from, x) ( *(to *)&(from){(x)} )
unsigned sign_bit = TYPE_PUN(unsigned, int, INT_MIN) &
                    TYPE_PUN(unsigned, int, -1) & ~1u;

(there are probably nicer ways) instead of

unsigned sign_bit = 1u << sizeof sign_bit * CHAR_BIT - 1;

as this may shift by more than the width. (I don't know of a constant expression giving the width, but sign_bit from above can be right-shifted until it's 0 to determine it, Gcc can constant-fold that.) Padding bits can be inspected by memcpying into an unsigned char array, though they may appear to "wobble": Reading the same padding bit twice may give different results.

If you want the bit pattern (without padding bits) of a signed integer (little endian):

int print_bits_u(unsigned n) {
    for(; n; n>>=1) {
        putchar(n&1 ? '1' : '0'); // n&1 never traps
    }
    return 0;
}

int print_bits(int n) {
    return print_bits_u(*(unsigned *)&n & INT_MAX);
    /* This masks padding bits if int has more of them than unsigned int.
     * Note that INT_MAX is promoted to unsigned int here. */
}

int print_bits_2scomp(int n) {
    return print_bits_u(n);
}

print_bits gives different results for negative numbers depending on the representation used (it gives the raw bit pattern), print_bits_2scomp gives the two's complement representation (possibly with a greater width than a signed int has, if unsigned int has less padding bits).

Care must be taken not to generate trap representations when using bitwise operators and when type-punning from unsigned to signed, see below how these can potentially be generated (as an example, *(int *)&sign_bit can trap with two's complement, and -1 | 1 can trap with ones' complement).

Unsigned-to-signed integer conversion (if the converted value isn't representable in the target type) is always implementation-defined, I would expect non-2's complement machines to differ from the common definition more likely, though technically, it could also become an issue on 2's complement implementations.

From C11 (n1570) 6.2.6.2:

(1) For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.

(2) For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then M≤N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:

  • the corresponding value with sign bit 0 is negated (sign and magnitude);
  • the sign bit has the value -(2M) (two's complement);
  • the sign bit has the value -(2M-1) (ones' complement).

Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.

mafso
  • 5,433
  • 2
  • 19
  • 40
  • Upvoted for identifying that the standard limits the representations. I would be happy to be contradicted but I believe all modern platforms use two's-complement. Certainly all the hardware implementing the other formats in this article http://en.wikipedia.org/wiki/Signed_number_representations are literally museum pieces. There differences in how the bit-manipulation operators and particularly >> behave but the only place you'll find a computer not using two's-complement is granddad's basement. – Persixty Mar 09 '15 at 12:54
  • 1
    @DanAllen, I'm not aware of any nor have I ever seen someone being aware of one here (the question is raised from time to time in comments). I'm quite sure there isn't any newer machine. – mafso Mar 09 '15 at 13:06
  • This answer is wrong. The `INT_MAX` masking in the `print_bits_u` call masks off the sign bit, and the initializer for `sign_bit` may have value bits set which are padding bits when interpreted as an `int` (thus also need some masking). I'll fix that if I have found a proper mask. – mafso Mar 09 '15 at 21:25
  • I begin to believe this mask doesn't exist for portable code. An implementation could have `int` with 30 value-bits and 1 sign-bit, and 1 padding bit which is set if and only if the sign-bit is set (if exactly one of the sign and padding bit is set, it is a trap representation). If the padding bit is a value bit when read as an `unsigned int`, there's no way to determine which bit is which, I think... – mafso Mar 09 '15 at 21:52
  • The `TYPE_PUN` macro is incorrect. See http://stackoverflow.com/questions/11442708/type-punning-and-unions-in-c – Cuadue Mar 09 '15 at 23:20
  • 1
    @Cuadue: The name is probably not that well-chosen, but the usage is correct. To type-pun between corresponding signed and unsigned types, the code shown is defined. For type-punning between, say, `float` and `int` (even if their size is the same), this would break strict aliasing. – mafso Mar 09 '15 at 23:24
  • @mafso Yes, I see that now. Besides, it's incidental to the answer! Sorry for being so pedantic :) – Cuadue Mar 10 '15 at 00:10
  • @Cuadue: You're welcome. Being pedantic is good for code intended to be language-lawyer safe :) And I do think the code is broken for machines where signed padding bits are accessible via the unsigned counterpart (see my comments above), it's just not the type-punning itself which is problematic. – mafso Mar 10 '15 at 09:42
  • @mafso The C++23 will limit integer representations to 2's complement. You might want to update your answer based on that. Also, I never realized this before, but the standard section you quoted says the bits have to be powers of 2, but does not specify ordering. So I am not sure if your 2nd paragraph holds. In your example, representation of 5 needs two bits to be 1, but it is not necessarily 101 (could be 110). – tinkerbeast Jul 04 '22 at 17:22
6

To add to mafso's excellent answer, there's a part of the ANSI C rationale which talks about this:

The Committee has explicitly restricted the C language to binary architectures, on the grounds that this stricture was implicit in any case:

  • Bit-fields are specified by a number of bits, with no mention of “invalid integer” representation. The only reasonable encoding for such bit-fields is binary.
  • The integer formats for printf suggest no provision for “invalid integer” values, implying that any result of bitwise manipulation produces an integer result which can be printed by printf.
  • All methods of specifying integer constants — decimal, hex, and octal — specify an integer value. No method independent of integers is defined for specifying “bit-string constants.” Only a binary encoding provides a complete one-to-one mapping between bit strings and integer values.

The restriction to binary numeration systems rules out such curiosities as Gray code and makes possible arithmetic definitions of the bitwise operators on unsigned types.

The relevant part of the standard might be this quote:

3.1.2.5 Types

[...]

The type char, the signed and unsigned integer types, and the enumerated types are collectively called integral types. The representations of integral types shall define values by use of a pure binary numeration system.

benj
  • 713
  • 6
  • 12
1

If you want to get the bit-pattern of a given int, then bit-wise operators are your friends. If you want to convert an int to its 2-complement representation, then arithmetic operators are your friends. The two representations can be different, as it is implementation defined:

Std Draft 2011. 6.5/4. Some operators (the unary operator ~, and the binary operators <<, >>, &, ^, and |, collectively described as bitwise operators) are required to have operands that have integer type. These operators yield values that depend on the internal representations of integers, and have implementation-defined and undefined aspects for signed types.

So it means that i<<1 will effectively shift the bit-pattern by one position to the left, but that the value produced can be different than i*2 (even for smal values of i).

Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69
  • If **i*2** (the arithmetic result) is non-negative and in range, `i<<1` and `i*2` are equal. – mafso Mar 09 '15 at 12:20
  • `These operators yield values that depend on the internal representations of integer`. You could have Gray code for integers and Gray<<2 is not Gray*2. – Jean-Baptiste Yunès Mar 09 '15 at 13:55
  • 1
    No. Undefined aspects doesn't mean everything is undefined. The shift operators are defined in terms of values, not bit patterns. And it's not the intent to allow Gray code (in fact, it's the intent to disallow it, from the C99 Rationale V5.10 6.2.6.2: _The restriction to binary numeration systems rules out such curiosities as Gray code and makes possible arithmetic definitions of the bitwise operators on unsigned types._), so a machine using Gray code internally must behave as if it used one of the three allowed encodings. – mafso Mar 09 '15 at 14:29
  • Right! I've made a serious mistake 6.5.7/4 The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 × 2E2, reduced modulo one more than the maximum value representable in the result type. – Jean-Baptiste Yunès Mar 09 '15 at 19:05
  • Yes, it's defined in terms of values _and_ in representation (though the standard doesn't say what "left-shifted" means). But the representation is also defined in terms of values of all defined results of `<<`, so it doesn't matter. `n< – mafso Mar 09 '15 at 19:26