0

I am trying to measure FLOPS in rather simple way:

clock_gettime(CLOCK_REALTIME, &start);
num1 + num2;
clock_gettime(CLOCK_REALTIME, &end);
ns += end.tv_nsec - start.tv_nsec;

I run this in a loop and then compute how many nanoseconds on average does it take to do this operation.

I am obtaining results that I was not expecting based on the published performance numbers of my CPU.

After further reading my guess is that I am erroneously equating the C statement of adding two floating point numbers with a Floating Point Operation.

My question is: How exactly is a FLOP measured? Is it purely based properties of the CPU such as their frequency?

My complete code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
    if (argc != 2) return -1;
    int n = atoi(argv[1]);
    if (n <= 0) return -1;

    float num1;
    float num2;
    struct timespec start, end, res;
    float ns = 0;
    clock_getres(CLOCK_REALTIME, &res);
    fprintf(stderr, "CLOCK resolution: %ld nanosecond(s).\n", res.tv_nsec);

    for (int i = 0; i < n; i++) {
        num1 = ((float)rand()/(float)(RAND_MAX));
        num2 = ((float)rand()/(float)(RAND_MAX));
        clock_gettime(CLOCK_REALTIME, &start);
        num1 + num2;
        clock_gettime(CLOCK_REALTIME, &end);
        ns += end.tv_nsec - start.tv_nsec;
    }
    fprintf(stderr, "Average time per operation: %.4f\n", ns/n);
} 
jregalad
  • 374
  • 2
  • 12
  • 3
    You cannot measure a single operation this way. The time of a single operation is tiny compared to the time to call the clock routines and the resolution of the measurements of time, and there are possibilities of interrupts and other factors affecting the measurement. The time of one operation will be lost in the noise. Also, optimization by the compiler may move the operation outside the `clock_gettime` calls or remove it entirely since it has no observable behavior. There must be other questions on measuring performance, so look for those. – Eric Postpischil Mar 09 '21 at 10:46
  • 1
    You can check this [repository](https://github.com/brianolson/flops) on github to see some examples. Beside the problems with the compiler optimization, your code will be preempted by the process scheduler which will give you different results each time you re-run. – jordanvrtanoski Mar 09 '21 at 11:45
  • @EricPostpischil gotcha. I had the suspicion my routine was too naive. Decided to post the question because even though there is a lot of information about the topic. I am having trouble finding actual examples that I can "manipulate" so to speak. – jregalad Mar 09 '21 at 12:44
  • @jordanvrtanoski Thanks!! Will check it out. – jregalad Mar 09 '21 at 12:46
  • 1
    In addition to what was already mentioned. You're not using the result of `num1 + num2` anywhere, so it will be deleted by the compiler. Regarding FLOP numbers - published numbers are usually calculated for vector FMA (multiply+add) instructions and multiplied by the number of cores. See https://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle – stepan Apr 25 '21 at 20:20

0 Answers0