Endianness conversion without relying on undefined behavior

Question

I am using C to read a .png image file, and if you're not familiar with the PNG encoding format, useful integer values are encoded in .png files in the form of 4-byte big-endian integers.

My computer is a little-endian machine, so to convert from a big-endian uint32_t that I read from the file with fread() to a little-endian one my computer understands, I've been using this little function I wrote:

#include <stdint.h>

uint32_t convertEndian(uint32_t val){
  union{
    uint32_t value;
    char bytes[sizeof(uint32_t)];
  }in,out;
  in.value=val;
  for(int i=0;i<sizeof(uint32_t);++i)
    out.bytes[i]=in.bytes[sizeof(uint32_t)-1-i];
  return out.value;
}

This works beautifully on my x86_64 UNIX environment, gcc compiles without error or warning even with the -Wall flag, but I feel rather confident that I'm relying on undefined behavior and type-punning that may not work as well on other systems.

Is there a standard function I can call that can reliably convert a big-endian integer to one the native machine understands, or if not, is there an alternative safer way to do this conversion?

You can use good ol' shifts for unsigned types. Not sure about signed ones, but it certainly can't be impossible. — Oppen, May 21 '20 at 20:26
htonl() and ntohl() rely on the `arpa/inet.h` file which is not available on non-UNIX systems — Willis Hershey, May 21 '20 at 20:33
@FredLarson I think it is making an unnecessary assumption on the endianness of the "network" — Eugene Sh., May 21 '20 at 20:33
Use `uint8_t bytes` instead of `char bytes`. On rare machines where `char` is not 8 bits, code will not compile rather than compile and perform incorrectly. — chux - Reinstate Monica, May 21 '20 at 20:34
Does this answer your question? [convert big endian to little endian in C \[without using provided func\]](https://stackoverflow.com/questions/2182002/convert-big-endian-to-little-endian-in-c-without-using-provided-func) — Fred Larson, May 21 '20 at 20:35
@EugeneSh. not really. The naming of the functions is bad, but the specification is clear about it: "network" == "big" for those. And the program itself, OP says its checking the byte order provided by the PNG file first. — Oppen, May 21 '20 at 20:36
Note that `convertEndian()` is doing a endian swap and not a "convert a big-endian integer to one the native machine". I'd expect a `big_to_host32()` would be a better approach. — chux - Reinstate Monica, May 21 '20 at 20:40
From a naming POV, `ntohl()` implies network-to-long, yet "network" implies "big" even if some attached network protocol used "little" endian and "long" implies 32-bit, even if `long` is 64-bit. I like ones like `be32toh()` better. — chux - Reinstate Monica, May 21 '20 at 21:07
Your code doesn't rely on UB, but it does rely on your machine being little-endian — M.M, May 21 '20 at 21:34
unfortunately there's no compile-time endian detection in Standard C, but GCC does provide predefined macros — M.M, May 21 '20 at 21:36

chux - Reinstate Monica · Accepted Answer · 2020-05-21T22:06:06.523

3

I see no real UB in OP's code.

Portability issues: yes.

"type-punning that may not work as well on other systems" is not a problem with OP's C code yet may cause trouble with other languages.

Yet how about a big (PNG) endian to host instead?

Extract the bytes by address (lowest address which has the MSByte to highest address which has the LSByte - "big" endian) and form the result with the shifted bytes.

Something like:

uint32_t Endian_BigToHost32(uint32_t val) {
  union {
    uint32_t u32;
    uint8_t u8[sizeof(uint32_t)]; // uint8_t insures a byte is 8 bits.
  } x = { .u32 = val };
  return 
      ((uint32_t)x.u8[0] << 24) |
      ((uint32_t)x.u8[1] << 16) |
      ((uint32_t)x.u8[2] <<  8) |
                 x.u8[3];
}

Tip: many libraries have a implementation specific function to efficiently to this. Example be32toh.

edited May 21 '20 at 22:06

answered May 21 '20 at 20:52

chux - Reinstate Monica

143,097
13
135
256

This function does not correctly reverse the endianness of an input, and even if it did, it still accesses an unset member of a union, which is the same potentially hazardous behavior as the code in the original question – Willis Hershey May 24 '20 at 00:12
@WillisHershey "it still accesses an unset member of a union" --> is incorrect. What unset member is that? Did you forget `= { .u32 = val };`? This does reverse the endian-ness if "My computer is a little-endian machine was true. – chux - Reinstate Monica May 24 '20 at 02:01
I was mistaken, your function does work. The situation I was trying to avoid was using unions to pretend a 4-byte integer was 4 1-byte integers. The only improvements here are replacing char with `uint8_t` and removing the loop, which are steps in the right direction, but don't completely solve the problem – Willis Hershey May 24 '20 at 02:57
@WillisHershey There is no pretending, just C specification and [.png](https://en.wikipedia.org/wiki/Portable_Network_Graphics#"Chunks"_within_the_file) compliant code and no UB. .png files have "4-byte big-endian integers" in a particular endian -big. This code converts that faithfully to the local `uint32_t`. What part of the problem do you see as not completely unsolved? – chux - Reinstate Monica May 24 '20 at 03:13

score 2 · Answer 2 · answered May 21 '20 at 22:23

IMO it'd be better style to read from bytes into the desired format, rather than apparently memcpy'ing a uint32_t and then internally manipulating the uint32_t. The code might look like:

uint32_t read_be32(uint8_t *src)   // must be unsigned input
{
     return (src[0] * 0x1000000u) + (src[1] * 0x10000u) + (src[2] * 0x100u) + src[3];
}

It's quite easy to get this sort of code wrong, so make sure you get it from high rep SO users . You may often see the alternative suggestion return (src[0] << 24) + (src[1] << 16) + (src[2] << 8) + src[3]; however, that causes undefined behaviour if src[0] >= 128 due to signed integer overflow , due to the unfortunate rule that the integer promotions take uint8_t to signed int. And also causes undefined behaviour on a system with 16-bit int due to large shifts.

Modern compilers should be smart enough to optimize, this, e.g. the assembly produced by clang little-endian is:

read_be32:                              # @read_be32
    mov     eax, dword ptr [rdi]
    bswap   eax
    ret

However I see that gcc 10.1 produces a much more complicated code, this seems to be a surprising missed optimization bug.

score 0 · Answer 3 · answered May 24 '20 at 00:30

This solution doesn't rely on accessing inactive members of a union, but relies instead on unsigned integer bit-shift operations which can portably and safely convert from big-endian to little-endian or vice versa

#include <stdint.h>

uint32_t convertEndian32(uint32_t in){
  return ((in&0xffu)<<24)|((in&0xff00u)<<8)|((in&0xff0000u)>>8)|((in&0xff000000u)>>24);
}

score 0 · Answer 4 · answered May 24 '20 at 00:37

This code reads a uint32_t from a pointer of uchar_t in big endian storage, independently of the endianness of your architecture. (The code just acts as if it was reading a base 256 number)

uint32_t read_bigend_int(uchar_t *p, int sz)
{
    uint32_t result = 0;
    while(sz--) {
        result <<= 8;   /* multiply by base */
        result |= *p++; /* and add the next digit */
    }
}

if you call, for example:

int main()
{
    /* ... */
    uchar_t buff[1024];
    read(fd, buff, sizeof buff);

    uint32_t value = read_bigend_int(buff + offset, sizeof value);
    /* ... */
}

Endianness conversion without relying on undefined behavior

4 Answers4