How are the ntoh functions implemented under RHEL/GCC?

Question

A production issue has led our team to the following questions:

Under RHEL6 using GCC 4.4.6, how are ntohs and ntohl implemented?
Are the implementations known to be fast or slow?
How can I actually see the generated assembly code for the functions?

I know the implications behind questions may seem far-fetched and ridiculous, but I have been asked to investigate.

The hardware in question is an Intel box, little endian, 64-bit processor and compiled in 64 bit.

GCC.4.4 is a very old version of GCC. Current one is 4.8.1. You should consider upgrading your compiler (notice that C++ support improved a lot since GCC 4.4) — Basile Starynkevitch, Jul 30 '13 at 18:14
@Basile, 4.4 is the system compiler on RHEL6 and is supported and maintained, so it's not unreasonable to stick with it — Jonathan Wakely, Jul 30 '13 at 19:18
@BasileStarynkevitch: Indeed, I use GCC 4.7.2 and 4.8.1 for most development and R&D type work, but we use 4.4 for production code because that is what is distributed with the distro. — John Dibling, Jul 30 '13 at 20:15
@Charles: I think the RHEL6 tag is relevant in this case. Why did you remove it? — John Dibling, Jul 31 '13 at 12:26
@JohnDibling, given the answers, I'd actually sooner suggest tagging with glibc than creating a tag for RHEL 6, as the functions in question as provided by glibc, not the operating system itself. — Charles, Jul 31 '13 at 18:36

rodrigo · Answer 1 · 2013-07-30T17:30:33.897

Do the following:

test.c

#include <arpa/inet.h>
int main()
{
   volatile uint32_t x = 0x12345678;
   x = ntohl(x);
   return 0;
}

Then compile with:

$ gcc -O3 -g -save-temps test.c

And analyze the resulting test.s file, or alternatively run objdump -S test.o.

In my machine (Ubuntu 13.4) the relevant asssembler is:

movl    $305419896, 12(%esp)
movl    12(%esp), %eax
bswap   %eax
movl    %eax, 12(%esp)

Hints:

305419896 is 0x12345678 in decimal.
12(%esp) is the address of the volatile variable.
All the movl instructions are there for the volatile-ness of x. The only really interesting instruction is bswap.
Obviously, ntohl is compiled as an inline-intrinsic.

Moreover, if I look at the test.i (precompiled output), I find that the ntohl is #defined as simply __bswap_32(), which is an inline function with just a call to __builtin_bswap32().

+1. Also note that without the `volatile`, GCC will propagate the constant and swap it at compile time (the glibc implementation uses `builtin_constant_p` to achieve this). So the short answer to this question is "It's zero or one instructions so you can ignore it". This has been true since long before RHEL 6. — Nemo, Jul 30 '13 at 19:18

Jonathan Wakely · Accepted Answer · 2013-07-30T17:32:47.037

11

They're provided by glibc, not GCC, look in /usr/include/bits/byteswap.h for the __bswap_16 and __bswap_32 functions, which are used when optimization is enabled (see <netinet/in.h> for details of how.)
You didn't say what architecture you're using, on a big-endian system they're no-ops, so optimally fast! On little-endian they're architecture-specific hand-optimized assembly code.
Use GCC's -save-temps option to keep the intermediate .s files, or use -S to stop after compilation and before assembling the code, or use http://gcc.godbolt.org/

edited Jul 30 '13 at 17:32

answered Jul 30 '13 at 17:26

Jonathan Wakely

166,810
27
341
521

Sorry, quite right. This is a x86_64 machine, so little endian. – John Dibling Jul 30 '13 at 17:30

score 7 · Answer 3 · answered Jul 30 '13 at 17:24

These are implemented in glibc. Look at /usr/include/netinet/in.h. They will most likely rely on the glibc byteswap macros (/usr/include/bits/byteswap.h on my machine)

These are implemented in assembly in my header so should be pretty fast. For constants, this is done at compile time.

score 1 · Answer 4 · answered Aug 11 '13 at 13:32

GCC/glibc causes ntohl() and htonl() to be inlined into the calling code. Therefore, the function call overhead is avoided. Furthermore, each ntohl() or htonl() call is translated into a single bswap assembler operation. According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" bswap has both latency and throughput of "1" on all current Intel CPUs. So, only a single CPU clock is required to execute ntohl() or htonl().

ntohs() and htons() are implented as a rotation by 8 bit. This effectively swaps the two halfs of the 16-bit operand. Latency and throughput are similiar to that of bswap.

How are the ntoh functions implemented under RHEL/GCC?

4 Answers4

test.c