Quickest way to change endianness

Question

What is the quickest way to reverse the endianness of a 16 bit and 32 bit integer. I usually do something like (this coding was done in Visual Studio in C++):

union bytes4
{
    __int32 value;
    char ch[4];
};

union bytes2
{
    __int16 value;
    char ch[2];
};

__int16 changeEndianness16(__int16 val)
{
    bytes2 temp;
    temp.value=val;

    char x= temp.ch[0];
    temp.ch[0]=temp.ch[1];
    temp.ch[1]=x;
    return temp.value;
}

__int32 changeEndianness32(__int32 val)
{
    bytes4 temp;
    temp.value=val;
    char x;

    x= temp.ch[0];
    temp.ch[0]=temp.ch[1];
    temp.ch[1]=x;

    x= temp.ch[2];
    temp.ch[2]=temp.ch[3];
    temp.ch[3]=x;
    return temp.value;
}

Is there any faster way to do the same, in which I don't have to do so many calculations?

See [this topic][1], it mentions using intrin.h. [1]: http://stackoverflow.com/questions/105252/how-do-i-convert-between-big-endian-and-little-endian-values-in-c — Sebastiaan M, Sep 02 '11 at 05:07

score 8 · Accepted Answer · edited May 11 '18 at 03:14

8

Why aren't you using the built-in swab function, which is likely optimized better than your code?

Beyond that, the usual bit-shift operations should be fast to begin with, and are so widely used they may be recognized by the optimizer and replaced by even better code.

Because other answers have serious bugs, I'll post a better implementation:

int16_t changeEndianness16(int16_t val)
{
    return (val << 8) |          // left-shift always fills with zeros
          ((val >> 8) & 0x00ff); // right-shift sign-extends, so force to zero
}

None of the compilers I tested generate rolw for this code, I think a slightly longer sequence (in terms of instruction count) is actually faster. Benchmarks would be interesting.

For 32-bit, there are a few possible orders for the operations:

//version 1
int32_t changeEndianness32(int32_t val)
{
    return (val << 24) |
          ((val <<  8) & 0x00ff0000) |
          ((val >>  8) & 0x0000ff00) |
          ((val >> 24) & 0x000000ff);
}

//version 2, one less OR, but has data dependencies
int32_t changeEndianness32(int32_t val)
{
    int32_t tmp = (val << 16) |
                 ((val >> 16) & 0x00ffff);
    return ((tmp >> 8) & 0x00ff00ff) | ((tmp & 0x00ff00ff) << 8);
}

edited May 11 '18 at 03:14

Nick Gammon

1,173
10
22

answered Sep 02 '11 at 05:03

Ben Voigt

277,958
43
419
720

@James: I typed `swab` because I meant it: http://msdn.microsoft.com/en-us/library/e8cxb8tk – Ben Voigt Sep 02 '11 at 14:30
By casting to an unsigned type you can totally avoid sign extension, so you don't need the bit masks. `(val >> 8) & 0x00ff` gets `((uint16_t)val) >> 8`. Furthermore I would place this in a define or inline function due to the performance hit. – Max Truxa Mar 07 '13 at 07:44
@yourmt: Of course you want to have these inlined, but throwing in extra keywords doesn't make the answer clearer. Also, you can avoid that one bitmask, but not the others. I think consistency is clearer (and the compiler should do the same thing anyway). – Ben Voigt Mar 07 '13 at 16:44
1

1. swab doesn't solve the 32-bit case alone, although it handles the 16-bit case correctly 2. by design one expects swab to be slower because it handles the generic case of swapping adjacent bytes, not some fixed number of bytes – jheriko Sep 12 '16 at 15:50

score 5 · Answer 2 · answered Sep 02 '11 at 19:52

At least in Visual C++, you can use _byteswap_ulong() and friends: http://msdn.microsoft.com/en-us/library/a3140177.aspx

These functions are treated as intrinsics by the VC++ compiler, and will result in generated code that takes advantage of hardware support when available. With VC++ 10.0 SP1, I see the following generated code for x86:

return _byteswap_ulong(val);

mov     eax, DWORD PTR _val$[esp-4]
bswap   eax
ret     0

return _byteswap_ushort(val);

mov     ax, WORD PTR _val$[esp-4]
mov     ch, al
mov     cl, ah
mov     ax, cx
ret     0

Cubbi · Answer 3 · 2011-09-02T05:21:18.750

2

Who says it does too many calculations?

out = changeEndianness16(in);

gcc 4.6.0

movzwl  -4(%rsp), %edx
movl    %edx, %eax
movsbl  %dh, %ecx
movb    %cl, %al
movb    %dl, %ah
movw    %ax, -2(%rsp)

clang++ 2.9

movw    -2(%rsp), %ax
rolw    $8, %ax
movw    %ax, -4(%rsp)

Intel C/C++ 11.1

movzwl    4(%rsp), %ecx
rolw      $8, %cx
xorl      %eax, %eax
movw      %cx, 6(%rsp)

What does your compiler produce?

edited Sep 02 '11 at 05:21

answered Sep 02 '11 at 05:06

Cubbi

46,567
13
103
169

haven't checked the assembly code... Don't have the tools right now at office... :( – c0da Sep 02 '11 at 05:13
2

Please note: the `rolw` instruction is slower than might be expected for a single simple instruction. http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01234.html – Ben Voigt Sep 02 '11 at 18:01
@Ben Voigt quite possible, I was mainly responding to the "many calculations" assumption and inviting to look at the actual compiler output before discussing microoptimizations. Nice answer, by the way. – Cubbi Sep 02 '11 at 18:54

score 2 · Answer 4 · answered Sep 02 '11 at 06:19

2

I used the following code for the 16bit version swap function:

_int16 changeEndianness16(__int16 val)
{
    return ((val & 0x00ff) << 8) | ((val & 0xff00) >> 8);
}

With g++ (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5 the above code when compiled with g++ -O3 -S -fomit-frame-pointer test.cpp results in the following (non-inlined) assembler code:

movzwl  4(%esp), %eax
rolw    $8, %ax
ret

The next code is equivalent but g++ is not as good at optimizing it.

__int16 changeEndianness16_2(__int16 val)
{
    return ((val & 0xff) << 8) | (val >> 8);
}

Compiling it gives more asm code:

movzwl  4(%esp), %edx
movl    %edx, %eax
sarl    $8, %eax
sall    $8, %edx
orl     %edx, %eax
ret

answered Sep 02 '11 at 06:19

trenki

7,133
7
49
61

You don't get the same code, because it isn't actually equivalent. Sign extension will give wrong results with the second (the first version isn't correct either, whether or not it works depends on the platform, and specifically on `sizeof(int)`). – Ben Voigt Sep 02 '11 at 17:45
@BenVoigt - Why wouldn't the first version of code in this answer work, as you commented? – goldenmean Jul 03 '18 at 12:55
@goldenmean: Say that `val == 0x8000` on a system where `sizeof (int) == sizeof(__int16)`. Now the first term is 0, but `val & 0xff00` is `0x8000`, and `(val & 0xff00) >> 8` is `0xff80`. Now `0xff80` is not the byte swapped version of `0x8000`. – Ben Voigt Jul 03 '18 at 13:45

Quickest way to change endianness

4 Answers4

Linked