Determining how many clock cycles a codeblock need

Question

Is there a tool or method, which tells me how much clockcycles a codeblock uses? debugging and counting by hand is a a pain for huger codeblocks.

On a modern processor (e.g. modern x86), this isn't normally a meaningful/useful statistic (due to out-of-order execution, memory stalls, instruction caching, branch prediction, etc.) — Oliver Charlesworth, Mar 28 '16 at 13:24

score 2 · Accepted Answer · edited May 23 '17 at 10:28

On x86, Intel's IACA (Intel Architecture Code Analyzer is the only static analyzer I know of. It assume zero cache misses, and various other simplifications, but is somewhat useful.

I think it also assumes all but the last branch are not-taken, so it's probably not useful for a loop body with taken branches.

IACA also has some mistakes in its data, e.g. it thinks shld is slow on Sandybridge. It does know about some non-obvious things, like the fact that SnB-family CPUs can't micro-fuse 2-register addressing modes.

It's essentially been abandoned since the update for Haswell. Skylake can run some instructions on more execution ports than Haswell (see Agner Fog's instruction tables), but the pipeline is similar enough that the results should be fairly useful. See also other links at the x86 tag wiki, including Intel's optimization manuals, to help you make sense of the output.

I like to use this iaca.sh wrapper script to make -64 the default (which I can override with -32). I forget how much of this I wrote (maybe just the if (($# >= 1)) bit at the end) and where the LD_LIBRARY_PATH part came from.

iaca.sh:

#!/bin/bash
myname=$(realpath "$0")
mypath=$(dirname "$myname")
ld_lib="$LD_LIBRARY_PATH"
app_loc="../lib"

if [ "$LD_LIBRARY_PATH" = "" ]
then
export LD_LIBRARY_PATH="$mypath/$app_loc"
else
export LD_LIBRARY_PATH="$mypath/$app_loc:$LD_LIBRARY_PATH"
fi

if (($# >= 1));then
    exec "$mypath/iaca" -64 "$@"
else
    exec "$mypath/iaca"  # there is no -help, just run with no args for help output
fi

example: in-place prefix sums, from SIMD prefix sum on Intel cpu:

#include <immintrin.h>

#ifdef IACA_MARKS_OFF
  #define IACA_START
  #define IACA_END
#else
  #include <iacaMarks.h>
#endif

// In-place rewrite an array of values into an array of prefix sums.
// This makes the code simpler, and minimizes cache effects.
int prefix_sum_sse(int data[], int n)
{

//    const int elemsz = sizeof(data[0]);
#define elemsz sizeof(data[0])   // clang-3.5 doesn't allow const int foo = ... as an imm8 arg to intrinsics

    __m128i *datavec = (__m128i*)data;
    const int vec_elems = sizeof(*datavec)/elemsz;
    // to use this for int8/16_t, you still need to change the add_epi32, and the shuffle

    const __m128i *endp = (__m128i*) (data + n - 2*vec_elems);  // pointer to last full vector we can load
    __m128i carry = _mm_setzero_si128();
    for(; datavec <= endp ; datavec += 2) {
        IACA_START
        __m128i x0 = _mm_load_si128(datavec + 0);
        __m128i x1 = _mm_load_si128(datavec + 1); // unroll / pipeline by 1
//      __m128i x2 = _mm_load_si128(datavec + 2);
//      __m128i x3;

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, elemsz));

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, 2*elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, 2*elemsz));

        // more shifting if vec_elems is larger

        x0 = _mm_add_epi32(x0, carry);  // this has to go after the byte-shifts, to avoid double-counting the carry.
        _mm_store_si128(datavec +0, x0); // store first to allow destructive shuffle (e.g. non-avx shufps for FP or pshufb for narrow integers)

        x1 = _mm_add_epi32(_mm_shuffle_epi32(x0, _MM_SHUFFLE(3,3,3,3)), x1);
        _mm_store_si128(datavec +1, x1);

        carry = _mm_shuffle_epi32(x1, _MM_SHUFFLE(3,3,3,3)); // broadcast the high element for next vector
    }
    // FIXME: scalar loop to handle the last few elements
    IACA_END
    return data[n-1];
    #undef elemsz
}

$ gcc -I/opt/iaca-2.1/include -Wall -O3 -c prefix-sum.c -march=nehalem -mtune=haswell
$ iaca.sh prefix-sum.o
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - prefix-sum.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 6.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 5.7  | 1.4    1.0  | 1.4    1.0  | 2.0  | 6.3  | 1.0  | 1.3  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | movdqa xmm3, xmmword ptr [rax]
|   1    | 1.0       |     |           |           |     |     |     |     |    | add rax, 0x20
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | movdqa xmm0, xmmword ptr [rax-0x10]
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm1, xmm3
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm1, 0x4
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm3
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm3, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm3, 0x4
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm4, xmm1
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm3, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm4, 0x8
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm0, xmm3
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm4
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm0, 0x8
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm2
|   1    |           | 0.8 |           |           |     | 0.2 |     |     | CP | paddd xmm0, xmm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | movaps xmmword ptr [rax-0x20], xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm1, 0xff
|   1    |           | 0.9 |           |           |     | 0.1 |     |     | CP | paddd xmm0, xmm1
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 |    | movaps xmmword ptr [rax-0x10], xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xff
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm2, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rdx, rax
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0xffffffffffffff94
Total Num Of Uops: 20

Note that the total uop count is not fused-domain uops that matter for the frontend, ROB, and 4-wide issue/retire width. It counts unfused-domain uops, which matter for the execution units (and scheduler). It's kind of silly though, because in the unfused domain it mostly matters which port a uop needs, not how many there are.

This isn't the best example because it's trivially bottlenecked on the shuffle port in Haswell. It does show how IACA displays mov-elimination, micro-fused stores, and macro-fused compare-and-branch, though.

The distribution of uops between ports when there's a choice is fairly arbitrary. Don't expect it to match what real hardware does. I don't think IACA models the ROB/scheduler at all, really. This and other limitations have been discussed in previous SO questions. Try searching on IACA, since it's a fairly unique string.

Determining how many clock cycles a codeblock need

1 Answers1