Changing endianess, is union more efficient than bitshifts?

Question

I was asked for a challenge to change the endianess of an int. The idea I had was to use bitshifts

int    swap_endianess(int color)
{
    int a;
    int r;
    int g;
    int b;

    a = (color & (255 << 24)) >> 24;
    r = (color & (255 << 16)) >> 16;
    g = (color & (255 << 8)) >> 8;
    b = (color & 255)
    return (b << 24 | g << 16 | r << 8 | a);
}

But someone told me that it was more easy to use a union containing an int and an array of four chars (if an int is stored on 4 chars), fill the int and then reverse the array.

union   u_color
{
  int   color;
  char  c[4];
};

int             swap_endianess(int color)
{
  union u_color ucol;
  char          tmp;

  ucol.color = color;
  tmp = ucol.c[0];
  ucol.c[0] = ucol.c[3];
  ucol.c[3] = tmp;
  tmp = ucol.c[1];
  ucol.c[1] = ucol.c[2];
  ucol.c[2] = tmp;
  return (ucol.color);
}

What is the more efficient way of swapping bytes between those two? Are there more efficient ways of doing this?

EDIT

After having tested on an I7, the union way takes about 24 seconds (measured with time command), while the bitshift way takes about 15 seconds on 2,000,000,000 iterations. The is that if I compile with -O1, both of the methods will take only 1 second, and 0.001 second with -O2 or -O3.

The bitshift methods compile to bswap in ASM with -02 and -03, but not the union way, gcc seems to recognize the naive pattern but not the complicated union way to do it. To conclude, read the bottom line of @user3386109.

The union exploits undefined behavior. You might get problems this way. — fuz, Oct 09 '14 at 16:25
Your bit shift code has several bugs. You need code that works correctly, before you start worrying about efficiency. — user3386109, Oct 09 '14 at 16:26
There are fast ways to parrallelize these sorts of bit shifting operations by multiplying by magic numbers. Take a look at http://stackoverflow.com/a/15185392/3897316 for an example. — Degustaf, Oct 09 '14 at 16:29
@2501 the second of his answers. Writing to one member of a union and then reading from another is undefined behavior unless under certain circumstances where it becomes implementation-defined behavior. — fuz, Oct 09 '14 at 16:30
The idea here is to "isolate" the leftmost byte of color: I first set all other bits to zero and then shift the resulting byte to the right. It seems to work for me. @user3386109, what bugs? — Brendan Rius, Oct 09 '14 at 16:36
@FUZxxl Maybe in the older versions, but in the c11( or even c99 ) onward, accessing a different member through a union is allowed; it is called type-punning. http://stackoverflow.com/questions/25664848/unions-and-type-punning — 2501, Oct 09 '14 at 16:39
@alk Because `ntohl` and `htonl` don't swap endianness. They are no-ops on a big endian machine. — David Heffernan, Oct 09 '14 at 16:41
After having tested, the bitshift way takes ~15 seconds on 2,000,000,000 iterations, while the union way takes ~24 seconds. — Brendan Rius, Oct 09 '14 at 16:42
Try your function with an input of 0x11223344, and see what you get. — user3386109, Oct 09 '14 at 16:46

score 3 · Accepted Answer · answered Oct 09 '14 at 18:09

Here is the correct code for a byte swap function

uint32_t changeEndianess( uint32_t value )
{
    uint32_t r, g, b, a;

    r = (value >> 24) & 0xff;
    g = (value >> 16) & 0xff;
    b = (value >>  8) & 0xff;
    a =  value        & 0xff;

    return (a << 24) | (b << 16) | (g << 8) | r;
}

Here's a function that tests the byte swap function

void testEndianess( void )
{
    uint32_t value = arc4random();
    uint32_t result = changeEndianess( value );
    printf( "%08x %08x\n", value, result );
}

Using the LLVM compiler with full optimization, the resulting assembly code for the testEndianess function is

0x93d0:  calll  0xc82e                    ; call `arc4random`
0x93d5:  movl   %eax, %ecx                ; copy `value` into register CX
0x93d7:  bswapl %ecx                 ; <--- this is the `changeEndianess` function
0x93d9:  movl   %ecx, 0x8(%esp)           ; put 'result' on the stack
0x93dd:  movl   %eax, 0x4(%esp)           ; put 'value' on the stack
0x93e1:  leal   0x6536(%esi), %eax        ; compute address of the format string
0x93e7:  movl   %eax, (%esp)              ; put the format string on the stack
0x93ea:  calll  0xc864                    ; call 'printf'

In other words, the LLVM compiler recognizes the entire changeEndianess function and implements it as a single bswapl instruction.

Side note for those wondering why the call to arc4random is necessary. Given this code

void testEndianess( void )
{
    uint32_t value = 0x11223344;
    uint32_t result = changeEndianess( value );
    printf( "%08x %08x\n", value, result );
}

the compiler generates this assembly

0x93dc:  leal   0x6524(%eax), %eax        ; compute address of format string 
0x93e2:  movl   %eax, (%esp)              ; put the format string on the stack
0x93e5:  movl   $0x44332211, 0x8(%esp)    ; put 'result' on the stack
0x93ed:  movl   $0x11223344, 0x4(%esp)    ; put 'value' on the stack
0x93f5:  calll  0xc868                    ; call 'printf'

In other words, given a hardcoded value as input, the compiler precomputes the result of the changeEndianess function, and puts that directly into the assembly code, bypassing the function entirely.

The bottom line. Write your code the way it makes sense to write your code, and let the compiler do the optimizing. Compilers these days are amazing. Using tricky optimizations in source code (e.g. unions) may defeat the optimizations built into the compiler, actually resulting in slower code.

fuz · Answer 2 · 2014-10-09T20:20:51.377

2

You can also use this code which might be slightly more efficient:

#include <stdint.h>

extern uint32_t
change_endianness(uint32_t x)
{
    x = (x & 0x0000FFFFLU) << 16 | (x & 0xFFFF0000LU) >> 16;
    x = (x & 0x00FF00FFLU) <<  8 | (x & 0xFF00FF00LU) >>  8;
    return (x);
}

This is compiled by gcc on amd64 to the following assembly:

change_endianness:
    roll $16, %edi
    movl %edi, %eax
    andl $16711935, %edi
    andl $-16711936, %eax
    salq $8, %rdi
    sarq $8, %rax
    orl  %edi, %eax
    ret

To get an even better result, you might want to employ embedded assembly. The i386 and amd64 architectures provide a bswap instruction to do what you want. As user3386109 explained, compilers might recognize the “naïve” approach and emit bswap instructions, something that doesn't happen with the approach from above. It is however better in case the compiler is not smart enough to detect that it can use bswap.

edited Oct 09 '14 at 20:20

answered Oct 09 '14 at 16:33

fuz

88,405
25
200
352

I takes ~12 seconds for 2,000,000,000 iterations – Brendan Rius Oct 09 '14 at 16:54
@BrendanRius On what system? Can you also publish figures for the others? – fuz Oct 09 '14 at 17:21
+1 for a tight solution. Note: Sometimes, but I do not think in this case, a `u` suffix on a literal constant is needed to prevent `int` vs. unsigned operations. But still suggest `0x0000FFFFUL`. – chux - Reinstate Monica Oct 09 '14 at 18:44
@chux I had the right thought with the `L`, but I forgot about the U. Thanks for the hint. – fuz Oct 09 '14 at 20:21
Are the `& 0x0000FFFFLU` and `& 0xFFFF0000LU` necessary? Don't the affected bits get shifted out anyway? – Kelly Bundy Jan 09 '22 at 01:20

Changing endianess, is union more efficient than bitshifts?

EDIT

2 Answers2

Linked