1

I want to get an idea of how fast the modulus (%) operator runs. I have setup a simple program to benchmark % being applied to randomly generated values. The time is measured in nanoseconds with a high resolution clock. Often times it reports 0ns has elapsed. Obviously nothing happens instantaneously, so why would this be? If I increase the number of rounds to about 50,000 it usually takes about 1,000,000ns. But even 5000 rounds is always 0ns . Am I measuring it wrong? What optimization is being done to allow for this?

#include <iostream>
#include <chrono>
#include <random>

void runTest(const int rounds, const int min, const int max);

int main()
{
    std::cout << "started" << std::endl;
    runTest(5000, 1000000, 2000000);

    return 0;
}



/*IN: number of rounds to run on the test, the min and max value to choose between for operands to mod
OUT: time taken (in nanoseconds) to complete each operation on the same randomly generated numbers*/
void runTest(const int rounds, const int min, const int max)
{
    std::random_device rd;     // only used once to initialise (seed) engine
    std::mt19937 rng(rd());    // random-number engine used (Mersenne-Twister in this case)
    std::uniform_int_distribution<int> uni(min,max); // guaranteed unbiased

    std::chrono::nanoseconds durationNormalMod = std::chrono::nanoseconds::zero();
    std::chrono::nanoseconds durationFastMod = std::chrono::nanoseconds::zero();

    long long result = 0;

    for(auto i = 0; i < rounds; i++)
    {
        const int leftOperand = uni(rng);
        const int rightOperand = uni(rng);
        auto t1 = std::chrono::high_resolution_clock::now();
        long long x = (leftOperand % rightOperand);
        auto t2 = std::chrono::high_resolution_clock::now();
        //std::cout << "x: " << x << std::endl;
        result += x;
        durationNormalMod += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1);
    }

    std::cout << "duration of %: " << durationNormalMod.count() << std::endl;
    std::cout << "result: " << result << std::endl;//preventing optimization by using result
}

I compile with g++ prog.cpp -o prog.exe -O3.

I'm interested because I have a specific case where I can implement modulus using a different algorithm and I'm curious if it's faster.

northerner
  • 756
  • 5
  • 21
  • You're not doing anything with the result `x` - it's possible the calculation was left out by the optimizer (check in the generated assembly). – Sander De Dycker Apr 16 '20 at 10:20
  • 1
    Also : 5000 iterations doesn't sound like a lot. – Sander De Dycker Apr 16 '20 at 10:21
  • It's very likely that the modulo operation translates to a single instruction on your platform. So it's unlikely your different algorithm will do better. – Sander De Dycker Apr 16 '20 at 10:23
  • @SanderDeDycker what exactly would I be checking in the assembly? – northerner Apr 16 '20 at 10:24
  • Is there a way to ensure something is not optimized away, like by using `volatile`? – northerner Apr 16 '20 at 10:26
  • you'd look for whatever instruction(s) correspond to the modulo calculation - you'll want to read your platform and/or compiler documentation for details on your specific flavor of assembly and notation. An easy way to ensure it's not optimized away, is to *use* it (eg. print all values or an aggregated value after your loop). You'll obviously also want to take the advice from the answers below, to measure the time the entire loop takes, rather than a single iteration. – Sander De Dycker Apr 16 '20 at 10:35
  • Thanks I did, and like I said it was being used. – northerner Apr 16 '20 at 10:36
  • If you have a special case that you suspect that you could do faster, you should measure that special case. It's entirely possible that the compiler does better - they are very good at special cases. – molbdnilo Apr 16 '20 at 11:13
  • To ensure something is not optimized away: `-O0`. But then benchmarking is worthless. – Eljay Apr 16 '20 at 13:20
  • What is the resolution of the high resolution clock on your system? Your jump from 0 to 1,000,000 looks, to me, like the 0 fell below the minimum measurable time (and that you should be measuring milliseconds instead, maybe microseconds). – JaMiT Apr 16 '20 at 14:24
  • Microbenchmarking is hard: [Idiomatic way of performance evaluation?](https://stackoverflow.com/a/60293070) – Peter Cordes Apr 16 '20 at 14:58

3 Answers3

4

When benchmarking it is important to:

  1. Use the computed result in some way. Any result that is never used and the computation that led to it can and will be removed by an optimizing compiler.

  2. Time not a single computation but a number of computations such that the elapsed time between clock accesses is on the order of milliseconds. This is because clock access is a relatively costly operation and will significantly skew the timing of very fast computations.

  3. Disable CPU clock frequency scaling or at least warm up the CPU before measuring the time.

rustyx
  • 80,671
  • 25
  • 200
  • 267
2

The clocks, even the high_resolution_clock one, on most implementations do not have granularity fine enough to measure nanoseconds.

You must perform multiple operations in a row, and divide the total time with the number of repetitions. You already have a loop: move the time measurement to the outside of the loop. And increase the number of repetitions.

It's a bit tricky though, since the loop can sometimes lead to the compiler doing some vector operation instead the intended trivial repetition.

You must also use the result of the calculation, or the optimiser will simply not perform it.

eerorika
  • 232,697
  • 12
  • 197
  • 326
1

Optimizations are done with respect to the so-called "as-if rule". The compiler is allowed to perform any transformations to your code as long as the observable behvior is the same. As you do not use the result of the calculations, there is no observable behavior that would change when the calcualtions are not done at all. Measured time does not count as observable behavior (see here).

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • I don't think this is what's happening. Why would running more iterations result in non-zero time taken if it was optimized away completely? Also in another version I did something simple with the result and it didn't make a difference. – northerner Apr 16 '20 at 10:29
  • 3
    @northerner Your time measurement time is in the loop, so you are measuring the time it takes to measure time and add measured zeroes together. Edit: actually, the addition is outside of measurement, but the overhead of measurement is still there. – eerorika Apr 16 '20 at 10:30
  • @northerner this is what **can** happen. `1,000,000ns` is probably small compared to the accuracy of your clock. Also `50000` is rather little – 463035818_is_not_an_ai Apr 16 '20 at 10:30
  • @northerner and what eerorika said. Compiler can optimize away the calculations but not the time measurement, so you basically add noise of `50000` measurements. If you get ~100 ns noise on each individual measurement (which wouldnt be bad) the sum is already 5000000ns ... – 463035818_is_not_an_ai Apr 16 '20 at 10:32
  • @northerner you can get a feeling for what is going on by removing all code and only measure time of the empty loop. I wouldnt be too surprised if you get very similar results – 463035818_is_not_an_ai Apr 16 '20 at 11:08