6

A production issue has led our team to the following questions:

  1. Under RHEL6 using GCC 4.4.6, how are ntohs and ntohl implemented?
  2. Are the implementations known to be fast or slow?
  3. How can I actually see the generated assembly code for the functions?

I know the implications behind questions may seem far-fetched and ridiculous, but I have been asked to investigate.

The hardware in question is an Intel box, little endian, 64-bit processor and compiled in 64 bit.

John Dibling
  • 99,718
  • 31
  • 186
  • 324
  • GCC.4.4 is a very old version of GCC. Current one is 4.8.1. You should consider upgrading your compiler (notice that C++ support improved a lot since GCC 4.4) – Basile Starynkevitch Jul 30 '13 at 18:14
  • 2
    @Basile, 4.4 is the system compiler on RHEL6 and is supported and maintained, so it's not unreasonable to stick with it – Jonathan Wakely Jul 30 '13 at 19:18
  • @BasileStarynkevitch: Indeed, I use GCC 4.7.2 and 4.8.1 for most development and R&D type work, but we use 4.4 for production code because that is what is distributed with the distro. – John Dibling Jul 30 '13 at 20:15
  • @Charles: I think the RHEL6 tag is relevant in this case. Why did you remove it? – John Dibling Jul 31 '13 at 12:26
  • 1
    @JohnDibling, given the answers, I'd actually sooner suggest tagging with glibc than creating a tag for RHEL 6, as the functions in question as provided by glibc, not the operating system itself. – Charles Jul 31 '13 at 18:36
  • @Charles: Fair enough. I'll add that tag. – John Dibling Jul 31 '13 at 19:05

4 Answers4

12

Do the following:

test.c

#include <arpa/inet.h>
int main()
{
   volatile uint32_t x = 0x12345678;
   x = ntohl(x);
   return 0;
}

Then compile with:

$ gcc -O3 -g -save-temps test.c

And analyze the resulting test.s file, or alternatively run objdump -S test.o.

In my machine (Ubuntu 13.4) the relevant asssembler is:

movl    $305419896, 12(%esp)
movl    12(%esp), %eax
bswap   %eax
movl    %eax, 12(%esp)

Hints:

  • 305419896 is 0x12345678 in decimal.
  • 12(%esp) is the address of the volatile variable.
  • All the movl instructions are there for the volatile-ness of x. The only really interesting instruction is bswap.
  • Obviously, ntohl is compiled as an inline-intrinsic.

Moreover, if I look at the test.i (precompiled output), I find that the ntohl is #defined as simply __bswap_32(), which is an inline function with just a call to __builtin_bswap32().

rodrigo
  • 94,151
  • 12
  • 143
  • 190
  • 1
    +1. Also note that without the `volatile`, GCC will propagate the constant and swap it at compile time (the glibc implementation uses `builtin_constant_p` to achieve this). So the short answer to this question is "It's zero or one instructions so you can ignore it". This has been true since long before RHEL 6. – Nemo Jul 30 '13 at 19:18
11
  1. They're provided by glibc, not GCC, look in /usr/include/bits/byteswap.h for the __bswap_16 and __bswap_32 functions, which are used when optimization is enabled (see <netinet/in.h> for details of how.)
  2. You didn't say what architecture you're using, on a big-endian system they're no-ops, so optimally fast! On little-endian they're architecture-specific hand-optimized assembly code.
  3. Use GCC's -save-temps option to keep the intermediate .s files, or use -S to stop after compilation and before assembling the code, or use http://gcc.godbolt.org/
Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
7

These are implemented in glibc. Look at /usr/include/netinet/in.h. They will most likely rely on the glibc byteswap macros (/usr/include/bits/byteswap.h on my machine)

These are implemented in assembly in my header so should be pretty fast. For constants, this is done at compile time.

Guillaume
  • 2,044
  • 12
  • 11
1

GCC/glibc causes ntohl() and htonl() to be inlined into the calling code. Therefore, the function call overhead is avoided. Furthermore, each ntohl() or htonl() call is translated into a single bswap assembler operation. According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" bswap has both latency and throughput of "1" on all current Intel CPUs. So, only a single CPU clock is required to execute ntohl() or htonl().

ntohs() and htons() are implented as a rotation by 8 bit. This effectively swaps the two halfs of the 16-bit operand. Latency and throughput are similiar to that of bswap.

Kai Petzke
  • 2,150
  • 21
  • 29