28

I found following question: Is fastcall really faster?

No clear answers for x86 were given so I decided to create benchmark.

Here is the code:

#include <time.h>

int __fastcall func(int i)
{   
    return i + 5;
}

int _stdcall func2(int i)
{   
    return i + 5;
}

int _tmain(int argc, _TCHAR* argv[])
{
    int iter = 100;
    int x = 0;
    clock_t t = clock();
    for (int j = 0; j <= iter;j++)
        for (int i = 0; i <= 1000000;i++)
            x = func(x & 0xFF);
    printf("%d\n", clock() - t);
    t = clock();
    for (int j = 0; j <= iter;j++)
        for (int i = 0; i <= 1000000;i++)
            x = func2(x & 0xFF);
    printf("%d\n", clock() - t);
    printf("%d", x);
    return 0;
}

In case of no optimization result in MSVC 10 is:

4671
4414

With max optimization fastcall is sometimes faster, but I guess it is multitasking noise. Here is average result (with iter = 5000)

6638
6487

stdcall looks faster!

Here are results for GCC: http://ideone.com/hHcfP Again, fastcall lost race.

Here is part of disassembly in case of fastcall:

011917EF  pop         ecx  
011917F0  mov         dword ptr [ebp-8],ecx  
    return i + 5;
011917F3  mov         eax,dword ptr [i]  
011917F6  add         eax,5

this is for stdcall:

    return i + 5;
0119184E  mov         eax,dword ptr [i]  
01191851  add         eax,5  

i is passed via ECX, instead of stack, but saved into stack in the body! So all the effect is neglected! this simple function can be calculated using only registers! And there is no real difference between them.

Can anyone explain what is reason for fastcall? Why doesn't it give speedup?

Edit: With optimization it turned out that both functions are inlined. When I turned inlining off they both are compiled to:

00B71000  add         eax,5  
00B71003  ret  

This looks like great optimization, indeed, but it doesn't respect calling conventions at all, so test is not fair.

Community
  • 1
  • 1
Andrey
  • 59,039
  • 12
  • 119
  • 163
  • 6
    Hehe, don't expected inlined code to respect a calling convention. It is fair, not making the call is the point of inlining. – Hans Passant Mar 29 '11 at 23:04
  • Most compilers have a `don't inline flag` – Martin York Mar 29 '11 at 23:07
  • @Hans Passant I turned inlining off and compiler still didn't respect convention – Andrey Mar 30 '11 at 10:28
  • Andrey, you can try to call your functions through: template __declspec(noinline) F NOIL( F f ) { return f; } eg. x = NOIL(func)(x & 0xFF); then, if you compile with full optimizations, fastcall are faster ! (maybe because they are called fastall) – Franck Freiburger Feb 23 '12 at 21:48
  • Results are likely to be way off here in 2019. See my post below. – Sirmabus May 05 '19 at 23:09

8 Answers8

31

__fastcall was introduced a long time ago. At the time, Watcom C++ was beating Microsoft for optimization, and a number of reviewers picked out its register-based calling convention as one (possible) reason why.

Microsoft responded by adding __fastcall, and they've retained it ever since -- but I don't think they ever did much more than enough to be able to say "we have a register-based calling convention too..." Their preference (especially since the 32-bit migration) seems to be for __stdcall. They've put quite a bit of work into improving their code generation with it, but (apparently) not nearly so much with __fastcall. With on-chip caching, the gain from passing things in registers isn't nearly as great as it was then anyway.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Afaik fastcall and Borland's equivalent "register" originates in some (686/ppro era?) abi document from Intel. – Marco van de Voort Feb 07 '18 at 12:22
  • @MarcovandeVoort: In that case (not to be rude, but...) you clearly just don't know. `fastcall` has been around since the 286 era. Borland has supported `register` since Turbo C 1.0 (at that time, `register` was meaningful for up to two variables, which were allocated in `si` and `di`). – Jerry Coffin Feb 07 '18 at 14:59
17

Your micro-benchmark produces irrelevant results. __fastcall has specific uses with SSE instructions (see XNAMath) , clock() is not even remotely a suitable timer for benchmarking, and __fastcall exists for multiple platforms like Itanium and some others too, not just for x86, and in addition, your whole program can be effectively optimized to nothing except the printf statements, making the relative performance of __fastcall or __stdcall very, very irrelevant.

Finally, you've forgotten to realize the main reason that a lot of things are done the way they are- legacy. __fastcall may well have been significant before compiler inlining became as aggressive and effective as it is today, and no compiler will remove __fastcall as there will be programs that depend on it. That makes __fastcall a fact of life.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 1
    @DeadMG about Itanium: "On Itanium Processor Family (IPF) and AMD64 machines, __fastcall is accepted and **ignored** by the compiler". http://msdn.microsoft.com/en-us/library/6xa169sk(v=VS.100).aspx – Andrey Mar 30 '11 at 10:31
  • 1
    @Andrey: A: `__fastcall` may store SSE primitives (128 bit values) in an SSE register, rather than pushing them onto the stack. (Therefore the `__fastcall` switch affects more than the integral registers if SSE intrinsics are being used) B. `clock` is not required to be accurate for benchmarking, particularly in cases where multiple cores are involved. Basically, things like context switches and such will count against you for `clock`, but not for the correct tools like `QueryPerformanceCounter`. (I.e. benchmarking should use platform specific tools) – Billy ONeal Mar 31 '11 at 16:06
  • @Billy ONeal A: I didn't find any reference to it in MSDN, do you have any? B: `clock` is bad for microbenchmarks. When it comes to several seconds between calls effect of context switching noise is neglectful. – Andrey Mar 31 '11 at 16:11
  • @Andrey: I thought I had a reference, but I can't find one right now. Sorry :( Take what I said there with a grain of salt. – Billy ONeal Mar 31 '11 at 16:16
  • 4
    @Andrey: I can't predict the value of `x`, but the compiler trivially can. – Puppy Mar 08 '12 at 12:09
9

Several reasons

  1. At least in most decent x86 implementations, register renaming is in effect -- the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level.
  2. Sure, you save some stack movement effort with __fastcall, but you reduce the number of registers available for use in the function without modifying the stack.

Most of the time where __fastcall would be faster the function is simple enough to be inlined in any case, which means that it really doesn't matter in real software. (Which is one of the main reasons why __fastcall is not often used)

Side note: What was wrong with Anon's answer?

Community
  • 1
  • 1
Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • "What was wrong with Anon's answer?" Nothing except that it is wrong. I showed that in compiled code nothing is saved (memory operations), and it is not faster. Values are passed via registers but for some reason it is copied to memory. Net profit is zero. – Andrey Mar 29 '11 at 22:00
  • I agree with your statement. `__fastcall` is theoretically reasonable only in case of simple functions, but with hard optimization compiler can handle it by himself. – Andrey Mar 29 '11 at 22:04
  • The thing that I can't understand is what is reason of `__fastcall` existence if it doesn't add any value? – Andrey Mar 29 '11 at 22:04
  • @Andrey: I suspect it would be because the optimizer isn't as smart about it, because nobody uses it :P – Billy ONeal Mar 29 '11 at 22:14
  • 2
    @Andrey: Because some other compilers (e.g. Borland) used it as a possible (and sometimes default) option -- MSVC++ and GCC need to be able to call such code in other DLLs. Also, previous versions of the compiler may well have been faster with that switch. Now we have better optimizers :) – Billy ONeal Mar 29 '11 at 22:15
  • @Billy: *"the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level."* Why is that? From what I just learned from the [wiki article](http://en.wikipedia.org/wiki/Register_renaming), register renaming has nothing to do with physical memory. Is there more to it than that? – BlueRaja - Danny Pflughoeft Mar 29 '11 at 22:16
  • @BlueRaja: It has everything to do with physical memory. Basically, register renaming uses more hardware registers than are exposed in the processor's ISA in order to turn RAM operations into register operations. The overhead of the `__stdcall` operation here is the memory load and store -- the CPU may optimize the push followed directly by a pop into a register-only operation, not touching physical memory at all (the memory write would still have to occur, but the time required to do the write may not be felt due to caching of the write). – Billy ONeal Mar 30 '11 at 01:24
  • @BlueRaja: Hmm.. after reading the Wiki article it seems like the specific term register renaming usually refers to renaming to improve instruction level parallelism -- so I may be using the wrong word here. But the basic idea is that the CPU need not actually wait for the write into main memory to complete before continuing. – Billy ONeal Mar 30 '11 at 01:29
  • @Billy ONeal About dlls. `fastcall` is not standardized so it is impossible to use it for exported functions as is. Microsoft's fastcall is incompatible with Borland's, so using function from Borland dll in MSVC will cause stack corruption. – Andrey Mar 30 '11 at 10:30
  • @Andrey: Microsoft and Borland aren't the only compilers out there. I'm speaking hypothetically here anyway -- you'd have to ask someone on the compiler team if you want to be sure. – Billy ONeal Mar 30 '11 at 14:54
  • @Billy ONeal I wish I could :) – Andrey Mar 30 '11 at 16:47
8

Fastcall is really only meaningful if you use full optimization (otherwise its effects will be buried by other artifacts), but as you note, with full optimization, the functions will be inlined and you won't see the effect of calling conventions at all.

So to actually test this, you need to make the functions extern declarations with the actual definitions in a separate source file that you compile separately and link with your main routine. When you do that, you'll see that __fastcall is consistently ~25% faster with small functions like this.

The upshot is that __fastcall is really only useful if you have a lot of calls to tiny functions that can't be inlined because they need to be separately compiled.

Edit

So with separate compilation and gcc -O3 -fomit-frame-pointer -m32 I see quite different code for the two functions:

func:
    leal    5(%ecx), %eax
    ret
func2:
    movl    4(%esp), %eax
    addl    $5, %eax
    ret

Running that with iter=5000 consistently gives me results close to

9990000
14160000

indicating that the fastcall version is a shade over 40% faster.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • if I enable full optimization and disable inlining both functions are compiled to same assembly. About `extern`: "Declarations of variables and functions at file scope are external by default." http://msdn.microsoft.com/en-us/library/0603949d(VS.80).aspx so it is meaningless – Andrey Mar 30 '11 at 10:41
  • The `extern` keyword may be the default, but if you put the definitions in a different source file, it has quite a significant effect. If you're getting the same assembly with inlining `disabled` I suspect MSVC is ignoring the `disable inlining` flag – Chris Dodd Nov 25 '11 at 18:15
3

I compiled the two function with i686-w64-mingw32-gcc -O2 -fno-inline fastcall.c. This is the assembly generated for func and func2:

@func@4:
    leal    5(%ecx), %eax
    ret
_func2@4:
    movl    4(%esp), %eax
    addl    $5, %eax
    ret $4

__fastcall really looks faster to me. func2 needs to load the input parameter from the stack. func can simply perform a %eax := %ecx + 5 and then returns to the caller.

Furthermore, the output of your programming is typically like this on my system:

2560
3250
154

So __fastcall does not only look faster, it is faster.

Also note that on x86_64 (or x64 as Microsoft calls it), __fastcall is the default and the old non-fastcall convetion does not exist anymore. http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions

By making __fastcall the default, x86_64 catches up with other architectures (such as ARM), where passing arguments in registers is also default.

Sven
  • 1,364
  • 2
  • 17
  • 19
  • 1
    Visual Studio 2013 added __vectorcall convention for x64, so your last statement is no longer true.http://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_vectorcall – dss539 Apr 24 '15 at 18:29
  • Thanks, I have mended my answer to reflect that. – Sven Apr 24 '15 at 21:36
2

Fastcall itself as a register based calling convention isn't great on x86 because there aren't that many named registers available and by using key registers for passing the values, all you're doing is potentially forcing the calling code to push other values onto the stack and forcing the called function if it is of sufficient complexity to do the same. Essentially from an assembly language perspective, you're increasing the pressure on those named registers and explicitly using stack operations to compensate. So even if the CPU has far more registers available for renaming, it isn't going to refactor the explicit stack operations that have to be inserted.

On the other hand, on more "register rich" architectures like x86-64, register based calling conventions (not exactly the same as fastcall of old, but same concept) are the norm and are used across the board. In other words, once we got out of a few named registers architecture like x86, to something with more register space, fastcall was back in a big way and became the default and really only way used today.

Jeremiah Gowdy
  • 5,476
  • 3
  • 21
  • 33
1

Note: even edited in May 2017 by the OP, this question and answers are likely to be way out of date and not relevant any more by 2019 (if not a few years ago earlier).

A) By at minimal MSVC 2017 (and 2019 released recently). most of the code is going to be inlined in optimized release builds anyhow. Probably the only function body you will see in the entire example now is "_tmain()".

That is unless you specifically do some tricks like declaring the functions as "volatile" and/or wrapping the test functions in pragmas that turn off some optimizations.

B) The latest generation of desktop CPUs (the assumption here) are much improved since the circa 2010 generation. They are much are better at caching the stack, memory alignment matters less, etc.

But don't take my word for it. Load up your executable in a dissembler (IDA Pro, MSVC debugger, etc.) and look for your self (a good way to learn).

Now it would be interesting to see what the performance would be over a large 32bit application. Example, take the last Open sourced DOOM game release and make builds with stdcall and _fastcall and look for framerate differences. And get metrics off of any built-in performance reporting features it has et al.

Sirmabus
  • 636
  • 8
  • 8
0

It does not appear that __fastcall actually indicates that it will be faster. Seems like all you're doing is moving the first fiew variables into registers before making the call to the function. This most likely makes your function call slower since it must move the variables into those registers first. Wikipedia had a pretty good write up about what exactly Fast Call is and how it is implemented.

Suroot
  • 4,315
  • 1
  • 22
  • 28
  • so answer is that __fastcall is useless then? Wiki says that compiler passes values via registers - true and shown in question, but no profit is taken from it, which makes me thing that __fastcall is useless. – Andrey Mar 29 '11 at 22:02
  • Not completely true; if you had 1-2 elements being passed in and you were doing an operation between both of them (assuming nothing else) you might see a speed up within the function since both elements already exist in the register. Note the word MIGHT, will all depend on the compiler and how it generates the assembly. Other than that, I would agree with Billy; it's not used much because there are very few cases where you're only going to operate SOLELY on the variables you passed in. Try changing i to a pointer and doing *i + 5 (no return). This way the return doesn't have to be dup'd. – Suroot Mar 29 '11 at 22:06
  • "since both elements already exist in the register" for some reasons MSVC don't use this useful assumption. It copies it from registers to stack inside function and uses it from memory making passing via registers useless. – Andrey Mar 30 '11 at 10:38