4

I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.

First:

union int_char {
    int val;
    unsigned char c[sizeof(int)];
} data;

data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);

for(int i = 0; i < sizeof(int); i++)
    printf("%.2x", data.c[i]);
    printf("\n");

Second:

for(int i = 0; i < 8*sizeof(int); i++) {
    int j = 8 * sizeof(int) - 1 - i;
    printf("%d", (val >> j) & 1);
}
printf("\n");

For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?

Student
  • 805
  • 1
  • 8
  • 11
  • 1
    They're both correct, it depends what order you want to show the bits in – M.M Jul 25 '18 at 11:04
  • 3
    You mean the second code prints `02000000`? `printf("%d", (val >> j) & 1);` should not print anything than `0` or `1`. It should never print a `2` – Gerhardh Jul 25 '18 at 11:10
  • The second will not output a `2`. The two samples explicitly print out the BYTES in opposite order. The second is less preferable, since it is possible for `>>` on an `int` to give undefined behaviour, and a `char` type is not guaranteed to be 8 bits. – Peter Jul 25 '18 at 11:11
  • Your first snippet shows how the bytes are stored in memory. The second snippet prints the bits. The first depends on endianess, the second does not. – Gerhardh Jul 25 '18 at 11:12
  • What type is `val`? How is it assigned? `printf("%d", (val >> j) & 1);` is a problem unless `val` is `unsigned`. A [mcve] would improve this post. – chux - Reinstate Monica Jul 25 '18 at 11:26
  • The undefined behaviour of ((int)a)>>n does not effect (((int)a)>>n) & 1 for small n. n < (sizeof(int) * 8) – William J Bagshaw Jul 25 '18 at 11:43
  • Just to settle an argument below, Can you tell us what hardware platform you're using? – Persixty Jul 25 '18 at 13:33

4 Answers4

3

Welcome to the exotic world of endian-ness.

Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.

The electrical engineers who build computers are more imaginative.

Someimes they store the most significant byte first but on your platform it's the least significant.

There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.

So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.

Because your first snippet looks at the value as a series of bytes it encounters then in endian order.

But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.

It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.

There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.

It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.

Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32. In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.

Footnote: It's been pointed out quite fairly in the comments that I haven't told the whole story. The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.

It also permits 'padding bits' that don't represent part of the value.

So in principle along with tackling endian-ness we need to consider representation.

In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.

Persixty
  • 8,165
  • 2
  • 13
  • 35
  • 2
    "we right numbers most significant digit first" --> How about "sixteen"? Looks like least significant digit first? ;-) – chux - Reinstate Monica Jul 25 '18 at 11:30
  • Endianness isn't all there is to this question, representations are widely implementation-defined (possible padding bits, position of the sign bit, ...), so it's really about telling which code prints the actual representation. –  Jul 25 '18 at 11:36
  • @chux Is six a digit? There is a digit called six. I've confused myself now. – Persixty Jul 25 '18 at 12:16
  • @FelixPalmen Platforms that don't use all the bits or don't use twos-complement as now pretty obscure. I think answers should be at level of the question. Too much information isn't always better. – Persixty Jul 25 '18 at 12:19
  • @Persixty unfortunately, to decide about the "correctness" of such code, this knowledge is needed. –  Jul 25 '18 at 13:19
  • 1
    @FelixPalmen "Widely", eh? I will bet you $100 that the OP is using a conventional 2's complement little-endian machine today, and that he will never encounter anything else during his entire programming career. – Steve Summit Jul 25 '18 at 13:24
  • @FelixPalmen Agreed by strict reading of the standard and a review of all known implementations. But as I said there is such a thing as relevance. – Persixty Jul 25 '18 at 13:28
  • @SteveSummit Unless he's found something in his grandad's basement. There are PDP-11s running nuclear reactors in Canada. But they're 2s-complement! Oh, and it scares the crap out of me if something is working on a nuclear reactor and posting questions on Stack Overflow... – Persixty Jul 25 '18 at 13:30
  • @SteveSummit 'widely' refered to the amount of freedom the implementation has as opposed to the amount of rules it actually has to follow -- not to the numbers of existing systems. –  Jul 25 '18 at 13:30
  • And all this reasoning is void anyways. As long as the standard doesn't mandate anything, it's **never** safe to rely on it. –  Jul 25 '18 at 13:31
  • @FelixPalmen "Wide latitude": fair enough. – Steve Summit Jul 25 '18 at 14:13
  • @FelixPalmen I've thought about your point. I now agree it needs to be mentioned. While I'll still fall of my chair if it's valuable to talk about the full scope of what the C Language Specification admits. I've added a Footnote to the answer. Though still advising that in practice this is something the OP is unlikely to meet. To be honest unlikely but lucky! I'd love to program a computer using signed magnitude just for novelty. – Persixty Jul 26 '18 at 10:21
  • @Persixty thanks, IMHO this really improves the answer. I thought it's necessary to tell "the whole story" because the question is asking about correctness of code. *padding bits* are the next issue, btw :) Never seen any machine using padding in `int` or anything other than 2's complement, sure, but as long as C allows it, you know .... –  Jul 26 '18 at 10:32
  • @FelixPalmen I meant to mention padding bits! Thanks again. – Persixty Jul 26 '18 at 10:40
  • @FelixPalmen My previous resistance was that we are almost certainly wasting the OPs time if we leave them thinking they might actually encounter these arcane beasts!. – Persixty Jul 26 '18 at 10:43
0

It depends on your definition of "correct".

The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):

int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);

The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.

edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.


*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.

**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.

***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:

#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
                  + (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))

int main(void)
{
    int x = 2;

    for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
    {
        putchar((unsigned) x & mask ? '1' : '0');
    }
    puts("");
}

See this answer for an explanation of this strange macro.

0

The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.

int n = 0x00000002; //n=2

or as you where get when printing integer as hex like in:

printf("%08x", n);

But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:

In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:

 00 00 00 02 

While in little endian system (most of OS) the bytes will be ordered in memory as:

 02 00 00 00
SHR
  • 7,940
  • 9
  • 38
  • 57
  • "*The correct one is 00000002*" <- really? I'd argue `0x00 0x00 0x06` could very well be a "correct" representation for `2`. –  Jul 25 '18 at 11:17
  • the question is about hex representation – SHR Jul 25 '18 at 11:17
  • 1
    What's "hex representation"? The question is about **representation**, which is the actual bit pattern **in memory**. –  Jul 25 '18 at 11:18
  • as I answered, it can be different, little endian will be the oposite then big endian – SHR Jul 25 '18 at 11:20
  • See my example in the first comment. I made up an `int` with 22 value bits and 2 padding bits. And now? –  Jul 25 '18 at 11:20
0

The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.

The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.

The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.

  • 1
    "*The result is also independent of how the >> operator is implemented for signed ints*" <- are you sure? Probably correct in practice, but all the standard (in 6.5.7 p5) has to say: "*If `E1` has a signed type and a negative value, the resulting value is implementation-defined.*" –  Jul 25 '18 at 12:11