How do I analyze performance of cpp / assembly code

Question

I'm trying to learn more about how to analyze the performance of my more frequently used methods.

I've tried using rand() and timing large numbers of calls to my methods as a method of performance measurement but I also want to learn more about how to measure performance by understanding what the assembly code is doing.

For example, I have read about people trying to optimize the sgn function (Is there a standard sign function (signum, sgn) in C/C++?) so i figured this would be a good place to start. I went to http://gcc.godbolt.org and generated asm for the following code (ICC with -march=core-avx2 -fverbose-asm -Ofast -std=c++11):

int sgn_v1(float val)
{
    return (float(0) < val) - (val < float(0));
}

and

int sgn_v2(float val)
{
  if (float(0) < val)      return  1;
  else if (val < float(0)) return -1;
  else                     return  0;
}

This generated the following assembly

L__routine_start__Z6sgn_v1f_0:
sgn_v1(float):
        vxorps    %xmm2, %xmm2, %xmm2                           #3.38
        vcmpgtss  %xmm2, %xmm0, %xmm1                           #3.38
        vcmpgtss  %xmm0, %xmm2, %xmm3                           #3.38
        vmovd     %xmm1, %eax                                   #3.38
        vmovd     %xmm3, %edx                                   #3.38
        negl      %eax                                          #3.38
        negl      %edx                                          #3.38
        subl      %edx, %eax                                    #3.38
        ret                                                     #3.38

and

L__routine_start__Z6sgn_v2f_1:
sgn_v2(float):
        vxorps    %xmm1, %xmm1, %xmm1                           #8.3
        vcomiss   %xmm1, %xmm0                                  #8.18
        ja        ..B2.3        # Prob 28%                      #8.18
        vcmpgtss  %xmm0, %xmm1, %xmm0                           #
        vmovd     %xmm0, %eax                                   #
        ret                                                     #
..B2.3:                         # Preds ..B2.1
        movl      $1, %eax                                      #9.12
        ret                                                     #9.12

My analysis starts off with the fact that sgn_v1 has 9 instructions and sgn_v2 has 6 or 5 instructions depending on the results of the jump. The previous post talks about how sgn_v1 is branchless and it seems that this is a good thing, I assume this means that multiple instructions in sgn_v1 can be executed at the same time. I went to http://www.agner.org/optimize/instruction_tables.pdf and I couldn't fund most of these operations in the haswell section (p187-p202).

How can I analyze this?

Edit:

Responding to @Raxvan's comments, I ran the following test program

extern "C" int sgn_v1(float);
__asm__(
"sgn_v1:\n"
"  vxorps    %xmm2, %xmm2, %xmm2\n"
"  vcmpgtss  %xmm2, %xmm0, %xmm1\n"
"  vcmpgtss  %xmm0, %xmm2, %xmm3\n"
"  vmovd     %xmm1, %eax\n"
"  vmovd     %xmm3, %edx\n"
"  negl      %eax\n"
"  negl      %edx\n"
"  subl      %edx, %eax\n"
"  ret\n"
);

extern "C" int sgn_v2(float);
__asm__(
"sgn_v2:\n"
"  vxorps    %xmm1, %xmm1, %xmm1\n"
"  vcomiss   %xmm1, %xmm0\n"
"  ja        ..B2.3\n"
"  vcmpgtss  %xmm0, %xmm1, %xmm0\n"
"  vmovd     %xmm0, %eax\n"
"  ret\n"
"  ..B2.3:\n"
"  movl      $1, %eax\n"
"  ret\n"
);

#include <cstdlib>
#include <ctime>
#include <iostream>

int main()
{
  size_t N = 50000000;
  std::clock_t start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v1(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v1 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;

  start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v2(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v2 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;

  start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v2(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v2 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;

  start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v1(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v1 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;
}

And I got the following result:

g++-4.8 -std=c++11 test.cpp && ./a.out
v1 Time: 423.81 ms
v2 Time: 657.226 ms
v2 Time: 666.233 ms
v1 Time: 436.545 ms

So the branchless result is clearly better; @Jim's suggested I look into how branch predictors work but I still can't find a way of computing how "full" the pipeline is...

simple way: write some code that uses that function lots of times, and measure the time difference. That should give you a ratio between first and second function. Another way is to use a profiler like Intel Vtune (or the one builtin Visual studio 2013). One thing to note is that this might give you different results **based on the compiler you are using, optimization and cpu architecture** — Raxvan, Sep 18 '14 at 16:15
Branchless means that there are no jump instructions. This is good because if the CPU is pipelined it can execute multiple instructions at once. If there are branches the CPU might have to discard some of the work it does in the pipelines if it turns out that its branch prediction was incorrect. (http://en.wikipedia.org/wiki/Branch_predictor) — Jim, Sep 18 '14 at 17:52
Using rand() in a benchmark is troublesome, real data isn't random. Something the processor counts on heavily, branch prediction is a big deal. You really don't have anything but an artificial result. Probably as good as it is going to get. — Hans Passant, Sep 18 '14 at 23:43
This static analyzer looks pretty useful: http://stackoverflow.com/a/26021338/120163 — Ira Baxter, Sep 25 '14 at 16:29
Be careful not to be like the drunk looking for his keys under the street lamp because that's where the light is. You thought of something, and then you want to make it fast. The real secret to making programs fast is to admit that you don't know where to look, and you *won't know* until the program tells you. You need to find out how to listen to the actual particular program. [*Here's an example.*](http://stackoverflow.com/a/927773/23771) — Mike Dunlavey, Sep 25 '14 at 17:40

score 0 · Accepted Answer · answered Sep 25 '14 at 15:55

In general time is a pretty noisy measurement especially when you are measuring things sequentially into a single run/process, that means one after the other with interleaving events that may add noise. As you mention branches have a major effect on the pipeline and as a rule of thumb code with less branches should perform better, in general the two main things that play role in performance are locality of reference and branch prediction while in more complicated cases such as when using multi-threading there are additional factors. To answer your question I would say that it is better to use tools such as perf for example that can indicate the number of cache misses and branch miss predictions which should give a good indication, in general depending on the platform you are developing on you may be able to find an appropriate tool that may query the performance counters of the CPU. Also you should really generate a set of random values and use the exact same ones with both functions so that you dismiss the noise from the execution of std::rand(). Finally bear in mind that code would perform differently depending on different compilers, compile options (obviously) and target architectures however some of the logic you can apply should persist regardless, as in your example the code without conditionals-branches should pretty much always perform better. If you want to get really picky about it you should really read the manuals of Intel (particularly for avx).

How do I analyze performance of cpp / assembly code

1 Answers1