Is there a programmatic way to estimate the time my CPU takes to perform a fp operation?

Question

By "fp operation" I mean "floating point operation". I'm working on a Linux box. Is there a system call that returns this value as a static metric or can you test this with an algorithm in C/C++/some other language?

edit: I should have mentioned that this isn't to test the efficiency of my code. I was asked during an interview how long a theoretical algorithm would take to run. I had to figure out how many FLOPS would be executed and multiply that by the time each operation would take to come up with a rough estimate. I just though it was an interesting question.

The "amount of time" that it takes to perform an operation is not (by itself) as very useful metric. Sure it takes 9 months for a woman to produce a baby, but a city has many woman and not all of them are necessarily able or willing to have a kid on demand. — Mysticial, Jul 30 '14 at 23:56
What everyone else said (as answer or comment), plus the linux kernel is still lacking a pre-cog feature. You may badger Linus to write one, or, if you're good, write it yourself ... — tink, Jul 31 '14 at 00:04
What floating op in particular? If you're interested in something that maps exactly to a single instruction, [Agner Fog's Instruction Tables](http://www.agner.org/optimize/) may be of some help. — Iwillnotexist Idonotexist, Jul 31 '14 at 02:34
@tink what exactly is a pre-cog feature? This sounds like something that might be fun — kjh, Aug 05 '14 at 18:55
possible duplicate: [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) / [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) - there is no single cost number you can add for instructions up to get performance, that's not how pipelined CPUs work, especially not with typical performance numbers like latency = 4 cycles, throughput = 2/cycle for FP math instructions on Skylake. It critically depends on dependency chains. — Peter Cordes, Nov 16 '22 at 14:56

score 4 · Answer 1 · edited Apr 13 '17 at 12:13

This is almost certainly not a useful metric. Many, many other things go in to code efficiency -- especially cache hits/misses.

With that being said, there's a ServerFault thread on this topic that links to an Intel benchmarking suite that you can use, but you should be aware that you will probably not see any correlation between your system's maximum FLOPS and the performance of your application.

Z boson · Accepted Answer · 2014-08-06T11:52:24.153

What you want to use is Agner Fog's software "Test programs for measuring clock cycles and performance monitoring". It's what he used to measure the latency and throughput of instructions to produce his famous instruction tables. It is well documented and includes device drivers (with instructions on how to install them) that you can use in your own code. This is especially useful because in order to measure certain quantities such as the real CPU frequency you need access to model specific registers (MSR) which user level code does not normally have access to.

Edit: based on your edit of an interview question to estimate how much time it will take to run for a floating point intensive operation you can use this formula:

time = efficiency * number_of_floating_point_operations / peak_flops.

The peak flops per core for many processors can be found from this link flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

Each of these quantities may be difficult to calculate/estimate but the efficiency is the most difficult since it can depend on many things such as:

Is the algorithm compute or memory bound?
How well can the algorithm use SIMD (e.g. SSE/AVX/FMA)?
How well can the algorithm use MIMD (e.g. multiple cores)?
How well does your implementation use the different cache levels?

To make this more clear let's consider two algorithms: matrix multiplication and the the dot product. It's easy to calculate the number of floating point operations for these two algorithms. The number of floating point operations for matrix multiplication is 2*n*n*n. For the dot product it's 2*n.

Matrix multiplication if done right it is compute bound can fully benefit from SIMD and MIMD. It's efficiency will start low for small n and plateau for large n. My own implementation gets up to 75%. The Intel MKL gets over 95% (but less than 90% using FMA).

So a back-of-the-envelop estimate of the time for matrix-multiplication for large n is to assume 100% efficiency giving time = 2*n^3/peak_flops.

However, for the dot product the efficiency will start high for small n and drop to a plateau for large n. That's because it's memory bound. So for large n the efficiency is determined by how fast memory can be read. For a modern machine that's about 10 GB/s. Since a modern desktop with four cores will have a peak flops over 100 GLOPS and floats are 4 or 8 bytes I would estimate the efficiency for large n close to 1% giving time = 0.01*n/peak_flops. I went ahead and tested this (see the code below). I get about 2.2 GLOPS on my system which has a peak of 236 GFLOPS so that's about 1% of the peak. The bandwidth of my system is about 11 GB/s.

Most algorithms are memory bound so knowing how fast your system can read memory (DDR3, DDR4,...) is one of the most useful metrics to estimate the time.

So in general if you know the number of floating point operations of an algorithm and the peak flops of your processor the first thing you should ask is if the algorithm is compute bound or memory bound for large n then for a back-of-the-envelope estimate of the time I would assume the efficiency was 100% for compute bound and for memory bound I would look up the bandwidth to estimate the efficiency.

This code estimates the data rate and GFLOPS from the dot product.

    #include <time.h>
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <stdint.h>

    float dot(float *x, float *y, int n) {
        float sum = 0;
        for(int i=0; i<n; i++) {
            sum += x[i]*y[i];
        }
        return sum;
    }

    int main(){
        const int LEN = 1 << 28;
        float *x = new float[LEN];
        float *y = new float[LEN];
        for(int i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;}

        uint32_t size = 2*sizeof(float)*LEN;

        clock_t time0 = clock();
        float sum = dot(x,y,LEN);
        clock_t time1 = clock();
        double dtime = (double)(time1 - time0) / CLOCKS_PER_SEC;
        double rate = 1.0*size/dtime*1E-9;
        double flops = 2.0*LEN/dtime*1E-9;
        printf("sum %f, dtime %f, rate %f, flops %f\n", sum, dtime, rate,flops);
    }

Thanks for the links, I'll look through all of these. – kjh Aug 05 '14 at 18:52 — kjh, Aug 05 '14 at 18:52

score 1 · Answer 3 · answered Jul 31 '14 at 00:00

It doesn't make much sense to try to determine the time took by a FLOP "in vacuum", since there are lots of other factors influencing it (whether the operands are in memory/cache/registers, what type of operation it actually is, if the compiler is emitting x87/SSE/SSE2/... instructions, whether it involves "strange" IEEE754 values, if the processor pipelines are used efficiently, if the code is branch-predictor friendly, ...).

You should instead use a profiler over the actual code of your algorithm, to see what are the real bottlenecks and how much time is actually spent on them in your specific code.

score 1 · Answer 4 · answered Jul 31 '14 at 00:01

1

There is a quick 'n dirty method by taking timestamps before and after doing some operations and substract them to see how much time was consumed. However, this is not pretty accurate.

answered Jul 31 '14 at 00:01

Die Usche

132
10

score 1 · Answer 5 · answered Jul 31 '14 at 09:23

Following provides an idea of variability from one of my benchmarks that uses long sequences of the same assembly code instructions, attempting to fill up pipelines and different tests use a variable number of registers. Then this is just for addition.

  Intel(R) Core(TM) i7-4820K CPU running at near 3.9 GHz

 Speeds adding to     1 Register  2 Registers  3 Registers  4 Registers

 32 bit Integer MIPS     4303         8553        11997        12294
 32 bit Float MFLOPS     1304         2608         3866         3866 SP
 64 bit Float MFLOPS     1304         2608         3866         3865 DP
 32 bit MMX Int MIPS     7824        14902        14936        14902
 32 bit SSE MFLOPS       5215        10431        15464        15463 SP
 64 bit SSE2 MFLOPS      2608         5216         7732         7731 DP

 32 bit SSE2 Int MIPS   15647        29803        29872        29803
 64 bit SSE2 Int MIPS    7823        14902        14936        14902

Is there a programmatic way to estimate the time my CPU takes to perform a fp operation?

5 Answers5