How to write a compile-time initialisation of a 4 byte character constant that is fully portable

Question

The (legacy) code looks roughly like this.

#define MAKEID(a,b,c,d) (((UInt32)a)<<24 | ((UInt32)b)<<16 
                        | ((UInt32)c)<<8 | ((UInt32)d) )
#define ID_FORM MAKEID('F','O','R','M')
//...
struct S { 
  int id;
  int x;
  double d;
  // other stuff
};
//...
S s;
socket.read(&s, sizeof(s));  // read network data over struct
if (s.id == ID_FORM) { }

The code reads a stream of characters into the struct. The S.id is known to be a 4 character constant such as 'FORM' or 'DATA' in stream (network) order, which determines the layout of the rest of the struct. All comparisons are integer using predefined constants.

The MAKEID macro is big-endian because it places the first (character) argument at the most significant byte, which is also the lowest memory address. A little endian version of the macro would look like this, placing the first (character) argument at the least significant byte, which is now the lowest memory address.

 #define MAKEID(a,b,c,d) (((UInt32)d)<<24 | ((UInt32)c)<<16 
                        | ((UInt32)b)<<8 | ((UInt32)a) )

The question is how to rewrite it so that it works equally well on big-endian and little-endian architectures.

No, I do not want to write both macros and choose which one with an #ifdef. There is no other endian dependency anywhere in the code and I'm not keen to introduce one here. Portable is the way to go.

No, I do not want to write a function. This constant is used in places where a function cannot go. I wrote a portable function that initialised a union, and the code won't compile.

Any kind of portable macro or template to do compile-time initialisation is what I'm looking for.

In answer to a comment, this is the real code. It's part of a network protocol and the other end takes care of the endianism in most cases. This happens to be an exception where the other end generates in network byte order and this end was historically written big-endian as a 4 byte character constant like 'FORM'. I need a point solution and not one that propagates the idea of endianism to elsewhere in the code.

I'm not a C++ expert, but in which places do you want to use this constant where a `constexpr` function isn't possible? — CodesInChaos, Feb 25 '14 at 13:19
That is definitely the most sensible question anyone has asked so far. It's used in enums and string-building macros, which are all evaluated at compile time, so a normal function won't work. Do you want to have a go at writing a constexpr function, so I can give you credit? — david.pfx, Feb 26 '14 at 11:01
If you really have "no other endian dependency anywhere", then I assume your `struct S` example above isn't really representative of your actual legacy code. Because you'd *totally* have endianness (not to mention floating-point format) issues if you just "laid network data directly over the top of this struct" on a system with different endianness. — Ilmari Karonen, Mar 02 '14 at 21:21
I assume changing the `id` field to a `char[4]` is out of the question? Because, based on your edit, that would be the "right" way to do it. — Ilmari Karonen, Mar 02 '14 at 23:03

score 7 · Answer 1 · answered Feb 23 '14 at 08:47

7

Your MAKEID macro is endianness independent. It works identically on both big- and little-endian systems.

The macro may appear to be big-endian specific, but the shifts and bitwise or operations in C++ are all defined in terms of the result they have on the values that are operated on, not the underlying storage of those values.
Doing 42 << 24 is guaranteed to put the value 42 in the most significant 8 bits of the result, regardless of the byte order. Similarly for the bitwise or operation. This means that the result of MAKEID(0x12, 0x34, 0x56, 0x78) is always 0x12345678, regardless of the byte ordering of the underlying storage.

If you want to produce an integer whose underlying storage always has the same bit pattern (for example, 0x12, 0x34, 0x56, 0x78), then you really have to rethink your approach. Such an integer will have the value 0x12345678 on a big-endian system, 0x78563412 on a little-endian system and perhaps 0x56781234 on a middle-endian system.
However, if that bit-pattern was received over a communication interface that was defined with a particular byte order (for example big-endian/network byte order), then you must convert any multi-byte values that you receive into the system's native byte order if you want those values to be interpreted correctly by the receiving system and that includes the four byte ID value.

That is why I said in an earlier version of the answer that if you find that on some systems (in particular those where the system's byte order doesn't match the communication's byte order) the ID read from the stream doesn't match the result of MAKEID, then the likely culprit is the deserialization code. The (de-)serialization code is the most important place to take endianness into account. For example, overlaying the struct you expect over the bytes you received is easy, but is the wrong solution if there can be a byte-order mismatch or a difference in padding.

answered Feb 23 '14 at 08:47

Bart van Ingen Schenau

15,488
4
32
41

Sorry, but you're totally wrong. I've edited the question to show why. – david.pfx Feb 23 '14 at 13:10
you're a clever guy and I'm sure you think you're right, but you're totally missing the point. There is NO serialisation code. The network data is laid directly over the top of this struct, and the MAKEID macro is the ONLY place where the network order is surfaced to the C code. This specific behaviour is what makes it endian-specific. Much appreciated if you could understand what the question is really asking and try to answer it. – david.pfx Feb 25 '14 at 12:59
3

@david.pfx Your serialization is the endian specific code. There is serialization code: The reinterpret cast between bytes and a struct you perform. IMO this kind of serialization is bad form and should be avoided. – CodesInChaos Feb 25 '14 at 13:17
6

@david.pfx: Overlaying the network data and the struct *is* a form of deserialization. It is the most fragile kind and a byte-order difference between the network stream and the receiving system is one of the ways that can break it. – Bart van Ingen Schenau Feb 25 '14 at 13:25
@CodesInChaos: tell me something I didn't already know. If I wanted advice, I'd ask for it. This is legacy code with the restrictions as laid out. Please provide an answer if you have one, or stand aside. – david.pfx Feb 26 '14 at 10:52
@BartvanIngenSchenau: Of course I know that. The question is how to make this port work without rewriting hundreds or thousands of lines of legacy code. The problem is as I stated it, not as you wish it to be. – david.pfx Feb 26 '14 at 10:53
2

@david.pfx: In that case, please update your question with an explanation why you need endian variations for this constant and why there are no problems with other multi-byte fields in your data. – Bart van Ingen Schenau Feb 26 '14 at 11:09
I think the OP's statement is incorrect. `MAKEID` contructs a *value*, so it doesn't relate to endianness, which is how those bytes are written in memory just like you said. What he wants is how to make that value when compared with another value in memory returns a correct result regardless of endianness. – phuclv Jun 19 '14 at 11:27

score 2 · Accepted Answer · edited Jun 23 '22 at 17:40

2

I set this same question to the programmers who work for me. Between us we came up with the following 4 solutions. We shall go with the macro, and perhaps convert the code over to use one of the functions as time permits.

unsigned int MakeId(char a, char b, char c, char d) {
  char x[4] = { a, b, c, d };
  return *(int*)x;
}
unsigned int MakeId(char a, char b, char c, char d) {
  union {
    char x[4];
    int i;
  } u = { a, b, c, d };
  return u.i;
}
unsigned int MakeId(const char* s) { return *(int*)s; }

#define MAKEID(s) *(int*)(s);

#define FORM_ID MAKEID("Form")

On this occasion the formidable minds of Stack Overflow did not deliver.

edited Jun 23 '22 at 17:40

ankostis

8,579
3
47
61

answered Feb 27 '14 at 13:30

david.pfx

10,520
3
30
63

6

Now I see - you needed a 4-byte constant that contains bytes 'F', 'o', 'r' and 'm' in network order. Here's yet another solution: `#define FORM_ID (*(int *) (char[]) {'F', 'o', 'r', 'm'})` It looks less nice than `*(int *) "Form"` but it doesn't have an extra zero byte. – aragaer Feb 27 '14 at 13:54
@aragaer: Nice! That's the kind of contribution I was hoping for. If this compiles in all the places we need, we might have a winner here. Thanks! – david.pfx Feb 27 '14 at 13:59
Unless you make sure that the `x` char array to be 4-byte aligned, the first and third solution most probably won't work on other architectures except for x86 because they don't allow unaligned access – phuclv Jun 19 '14 at 11:32
Yes, this is a problem full of subtleties. The union may well be the safest solution. – david.pfx Jun 19 '14 at 11:57
The last 2 macros work in C but fail to compile in C++, with: `error: taking address of temporary array`. – ankostis Jun 23 '22 at 17:47
@ankostis: It's been many years and the code is long gone, but I think that will depend on the argument to the macro. Inserting a `(void*)` cast might fix it (it often does in picky C++). – david.pfx Jun 25 '22 at 00:14

score 0 · Answer 3 · answered Feb 25 '14 at 13:08

0

Instead of having a constant defined differently on different machines you should process the data you received:

if (ntohl(s.id) == ID_FORM) { }

Edit: To avoid the editing the code, you can use the htonl to initialize the ID_FORM instead:

#define ID_FORM htonl(MAKEID('F','O','R','M'))

This relies on htonl being a macro. And it usually is. And if it is it is usually defined with the same conditional you're trying to avoid: http://www.jbox.dk/sanos/source/include/net/inet.h.html (as an example).

So if on your system htonl is not a macro, the only sane choice I see is to actually stick to #ifdef.

Keep in mind that from now on your ID_FORM is now in "network endianness", not "host endianness".

answered Feb 25 '14 at 13:08

aragaer

17,238
6
47
49

Again, that comparison is throughout a large body of legacy code, and is effectively untouchable. Please answer the question as asked, if you can. – david.pfx Feb 26 '14 at 10:54
I expected a comment like this. See the edited answer. – aragaer Feb 26 '14 at 11:03
OK, so now you understand the question. I already know that solution, and it's exactly what I've done many times before. What I'm looking for is a new idea for a portable solution that does not require an explicit BIG_ENDIAN vs LITTLE_ENDIAN and two separate macros. Do you have any suggestions? – david.pfx Feb 26 '14 at 12:43
The existing compare code is comparing two variables. One of those you got from the network, so it is in network endianness. The other one has to be created locally. Your options are to create a network-endian variable or create a host-endian variable then convert it to network-endian. Both cases require "#ifdef" either explicit or hidden inside `htonl` definition. – aragaer Feb 26 '14 at 13:51
No, they most certainly don't. See answer posted. – david.pfx Feb 27 '14 at 13:31

How to write a compile-time initialisation of a 4 byte character constant that is fully portable

3 Answers3

Linked