Why did the compiler put a 15-byte nop in the middle of this function?

Question

I'm working on a decompiler, and one of my friends sent me a little program that he wrote. One function had a funny output, and inspecting the disassembly, I found that it's pretty boring except that there's a 15-byte nop in the middle of it:

0000000100001300 <_strhash>:
   100001300:   55                      push   rbp
   100001301:   48 89 e5                mov    rbp,rsp
   100001304:   8a 0f                   mov    cl,BYTE PTR [rdi]
   100001306:   31 c0                   xor    eax,eax
   100001308:   84 c9                   test   cl,cl
   10000130a:   74 2b                   je     100001337 <_strhash+0x37>
   10000130c:   48 ff c7                inc    rdi
   10000130f:   31 d2                   xor    edx,edx
   100001311:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]
   100001318:   0f 1f 84 00 00 00 00 
   10000131f:   00 
   100001320:   0f b6 c9                movzx  ecx,cl
   100001323:   89 d0                   mov    eax,edx
   100001325:   c1 e0 05                shl    eax,0x5
   100001328:   29 d0                   sub    eax,edx
   10000132a:   01 c8                   add    eax,ecx
   10000132c:   8a 0f                   mov    cl,BYTE PTR [rdi]
   10000132e:   48 ff c7                inc    rdi
   100001331:   84 c9                   test   cl,cl
   100001333:   89 c2                   mov    edx,eax
   100001335:   75 e9                   jne    100001320 <_strhash+0x20>
   100001337:   5d                      pop    rbp
   100001338:   c3                      ret

I didn't even know that x86 instructions could go to 15 bytes.

What is the benefit of this big nop?

That looks suspiciously like a bug either in your friend's code or in your decompiler. — Ken Y-N, Dec 12 '16 at 05:11
The NOP is aligning the top of the loop to a 16-byte boundary. [Agner Fog](http://www.agner.org/optimize/optimizing_assembly.pdf) says this _Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be advantageous to align critical loop entries and subroutine entries by 16 in order to minimize the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16- byte boundary in the first few instructions after a critical loop entry or subroutine entry._ — Michael Petch, Dec 12 '16 at 05:14
@MichaelPetch, that's pretty much what I thought, and you even have a reference for it. If you put the same thing in an answer, I'll accept it. — zneak, Dec 12 '16 at 06:09
gcc's default for `-falign-loops` isn't usually that aggressive, so that might be a different compiler. Modern gcc tends to default to using `.p2align 4,,10` / `.p2align 3`, so it aligns to 16B if that will take less than 10B, but otherwise only pads to an 8B (2^3) boundary. e.g. see this gcc6.2 output on Godbolt: https://godbolt.org/g/kC3LwL — Peter Cordes, Dec 13 '16 at 02:01

Why did the compiler put a 15-byte nop in the middle of this function?

0 Answers0

Linked