Why does 64-bit VC++ compiler add nop instruction after function calls?

Question

I've compiled the following using Visual Studio C++ 2008 SP1, x64 C++ compiler:

I'm curious, why did compiler add those nop instructions after those calls?

PS1. I would understand that the 2nd and 3rd nops would be to align the code on a 4 byte margin, but the 1st nop breaks that assumption.

PS2. The C++ code that was compiled had no loops or special optimization stuff in it:

CTestDlg::CTestDlg(CWnd* pParent /*=NULL*/)
    : CDialog(CTestDlg::IDD, pParent)
{
    m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);

    //This makes no sense. I used it to set a debugger breakpoint
    ::GdiFlush();
    srand(::GetTickCount());
}

PS3. Additional Info: First off, thank you everyone for your input.

Here's additional observations:

My first guess was that incremental linking could've had something to do with it. But, the Release build settings in the Visual Studio for the project have incremental linking off.
This seems to affect x64 builds only. The same code built as x86 (or Win32) does not have those nops, even though instructions used are very similar:

I tried to build it with a newer linker, and even though the x64 code produced by VS 2013 looks somewhat different, it still adds those nops after some calls:

Also dynamic vs static linking to MFC made no difference on presence of those nops. This one is built with dynamical linking to MFC dlls with VS 2013:

Also note that those nops can appear after near and far calls as well, and they have nothing to do with alignment. Here's a part of the code that I got from IDA if I step a little bit further on:

As you see, the nop is inserted after a far call that happens to "align" the next lea instruction on the B address! That makes no sense if those were added for alignment only.

I was originally inclined to believe that since near relative calls (i.e. those that start with E8) are somewhat faster than far calls (or the ones that start with FF,15 in this case)

the linker may try to go with near calls first, and since those are one byte shorter than far calls, if it succeeds, it may pad the remaining space with nops at the end. But then the example (5) above kinda defeats this hypothesis.

So I still don't have a clear answer to this.

Looks suspiciously like RIP-relative indirect calls that were relaxed to direct calls by the linker – indirect calls are a byte longer on x86, so the linker inserted the nops to make them the same length. — , Jun 30 '17 at 20:45
They might be [inserted to allow for breakpoints](https://social.msdn.microsoft.com/Forums/ie/en-US/fc639fd6-ee29-4ae8-9f93-e03caed9cad4/jump-to-local-for-which-a-problem-is-created?forum=vstscode), but I'm not sure what they're doing in a release build. — wally, Jun 30 '17 at 20:45
Do you still get NOPs for a simple program with a single function call? — wally, Jun 30 '17 at 20:49
I suspect @Fanael is correct. Getting rid of that NOP would mean shifting all the code. But shifting all the code would change a lot of addresses. Seems like a chicken-egg problem that's being solved with NOPs. — Mysticial, Jun 30 '17 at 20:50
Of course, why use nops for this purpose is beyond me – address override prefixes incur no overhead for direct calls on modern CPUs and unlike nops, they don't count as a separate instruction in the decoders. — , Jun 30 '17 at 20:52
@Mysticial Weird how this isn't necessary with how ELF does (dynamic) linking. — fuz, Jun 30 '17 at 22:19
@fuz With ELF function calls to functions in shared libraries are always made through a stub function. On Windows function calls to functions in DLLs are often made directly using an indirect call instruction. The `call cs:LoadIconW` instruction in the disassembly above is an example of this. The location the disassembler has called `LoadIconW` contains a pointer to the actual `LoadIconW` function. — Ross Ridge, Jul 01 '17 at 02:34
@Mysticial: The other option would be to pad the `call` instruction with a prefix, but I'm not sure that's guaranteed to be future-proof. Some future ISA extension might use `rep call` to mean something special. I tested, and `call` works on Skylake when preceded by`rep`, or `0x40` (REX.W=0), or `0x48` (REX.W=1). I'd guess that a REX prefix is more future-proof. A linker would need to check that there wasn't already a REX prefix, though (e.g. from hand-written code with padding), and that's impossible because you can't unambiguously step backwards in x86. Multiple REP prefixes would be ok — Peter Cordes, Jul 03 '17 at 03:24
The linker does have to know it's a `call` or `jmp` instruction, right? The opcode has to change from indirect to rel32. (Hmm, prefixes on `jcc` instructions have special meaning as branch-prediction hints on P4. But `jcc` can't be indirect anyway, so could only appear for conditional tailcalls that were already using a direct jump.) — Peter Cordes, Jul 03 '17 at 03:30
Oh actually, I think `REX` has to be the last prefix if it appears, so checking the byte before the `call` opcode can give false positives (previous instruction ended with `0x4?`), but not false negatives. — Peter Cordes, Jul 03 '17 at 03:32
@PeterCordes "The other option would be to pad the call instruction with a prefix" like an address override prefix, which is free for `call` on pretty much everything and highly unlikely to ever change the meaning of `call`? That's what GNU ld is using. — , Jul 04 '17 at 02:28
@TriskalJM No, I'm not sure what's actually going on here, I'm just speculating, so I don't want to answer. — , Jul 04 '17 at 02:46
Can you show the disassembly of the produced object file and binary side by side? As in before and after linking. — Goswin von Brederlow, Jul 04 '17 at 10:04
@GoswinvonBrederlow: Sorry. I'm not really good with object files. I updated my original post with additional details though. — c00000fd, Jul 05 '17 at 01:13
@c00000fd they're all near calls, far calls are never used on most modern operating systems, because segmentation is dead (to be fair, call gates have some advantages over `syscall` and `sysenter`, but no OS I'm aware of uses them). The difference is direct vs indirect. — , Jul 05 '17 at 09:01
If the `nop`s are still there even with dynamic linking, the linker relaxation idea's gotta be wrong then. — , Jul 05 '17 at 09:02
I think Fanael is correct. The original call is an indirect call and its op-code is one byte longer than direct call. When the packer changed the indirect call to direct call, one nop was padded. Here is a similar question, https://reverseengineering.stackexchange.com/questions/8030/purpose-of-nop-immediately-after-call-instruction — Houcheng, Aug 16 '17 at 09:48
@Houcheng: There's no "packer" involved in producing that code. It was just compiled by VS2008 compiler. — c00000fd, Aug 16 '17 at 23:39
Just want to point out that I ended up on this question because of a PDF about static variables initialization in which the author prescribes calling functions (at least in some cases) with a pair of `call/nop`: « add the necessary call/nop to those functions within the “.init” section » (https://cseweb.ucsd.edu/~gbournou/CSE131/GlobalAndStaticVars.pdf) But them doesn't explain why. Makes the answers here not very satisfying to me. — foxesque, Mar 24 '20 at 10:53

score 5 · Answer 1 · answered Sep 14 '17 at 21:04

This is purely a guess, but it might be some kind of a SEH optimization. I say optimization because SEH seems to work fine without the NOPs too. NOP might help speed up unwinding.

In the following example (live demo with VC2017), there is a NOP inserted after a call to basic_string::assign in test1 but not in test2 (identical but declared as non-throwing¹).

#include <stdio.h>
#include <string>

int test1() {
  std::string s = "a";  // NOP insterted here
  s += getchar();
  return (int)s.length();
}

int test2() throw() {
  std::string s = "a";
  s += getchar();
  return (int)s.length();
}

int main()
{
  return test1() + test2();
}

Assembly:

test1:
    . . .
    call     std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign
    npad     1         ; nop
    call     getchar
    . . .
test2:
    . . .
    call     std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign
    call     getchar

Note that MSVS compiles by default with the /EHsc flag (synchronous exception handling). Without that flag the NOPs disappear, and with /EHa (synchronous and asynchronous exception handling), throw() no longer makes a difference because SEH is always on.

¹ For some reason only throw() seems to reduce the code size, using noexcept makes the generated code even bigger and summons even more NOPs. MSVC...

score 3 · Answer 2 · answered Feb 08 '19 at 07:59

3

This is special filler to let exception handler/unwinding function to detect correctly whether it's prologue/epilogue/body of the function.

answered Feb 08 '19 at 07:59

Anatoly Mikhailov

31
1

score -3 · Answer 3 · answered Sep 13 '17 at 17:08

-3

This is due to a calling convention in x64 which requires the stack to be 16 bytes aligned before any call instruction. This is not (to my knwoledge) a hardware requirement but a software one. This provides a way to be sure that when entering a function (that is, after a call instruction), the value of the stack pointer is always 8 modulo 16. Thus permitting simple data alignement and storage/reads from aligned location in stack.

answered Sep 13 '17 at 17:08

MEHM-

98
5

Although your statement is correct, it has nothing to do with my original question. – c00000fd Sep 13 '17 at 17:30
You asked why the nop's are added to the assembly, and that is the reason why. How does that not answer your question ? – MEHM- Sep 13 '17 at 17:33
I asked why `nop`s are added after a `call` instruction and not the function itself. What you mentioned applies to the body (or code) of the functions, and not the `call` instruction. – c00000fd Sep 13 '17 at 17:49
3

NOP can pad code to align `RIP` (but it's not in this case; look at the code addresses in the dump). The calling convention requires `RSP` to be aligned. NOP doesn't modify `RSP`. – Peter Cordes Sep 14 '17 at 05:33

Why does 64-bit VC++ compiler add nop instruction after function calls?

3 Answers3

Linked