How to prove that AND instruction is faster than IDIV on a commodity Intel x86

Question

To the best of my understanding, on Intel x86 (e.g., Ice Lake micro-architecture) I could expect AND for two unsigned integers to be faster than IDIV for the same two integers. However, if I write a program to really measure the time, it is hard to spot the difference.

To measure time, I use time.h, and the code is basically as follows:

integer_A = rand();
integer_B = rand();
start = clock();
for (size_t i=0; i<(1<<26); i++)
    integer_A &[or % to get IDIV] integer_B
end = clock();
elapsed_time = (end - start) / CLOCKS_PER_SEC;

How could I better reproduce the measurement to get a result that would prove that AND is faster than IDIV (if that is the case)?

I understand that time.h measurements are imperfect. But what is the best I can do within a program that anyone can run on their laptops to prove that AND is faster than IDIV?

If you want to measure asm instructions, why are you writing in C? (But yes, with a random integer, not a compile-time-constant, `int result = a % b` would use `idiv`.) The title question is a duplicate of [Assembly - How to score a CPU instruction by latency and throughput](https://stackoverflow.com/q/52260625), the question body is a near-duplicate of [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) and other more specific microbenchmark-methodology questions. — Peter Cordes, Aug 10 '20 at 03:56
Also, you could look at the results of automated micro-benchmark tools that have tested individual instructions for latency, uop counts, and throughput, on a variety of microarchitectures including IceLake client: https://uops.info/. Ice Lake added a new integer division unit that's much faster for 64-bit integer division (which was previously *very* slow on Intel CPUs even for small numbers; [Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux](https://stackoverflow.com/a/52558274)). See also https://stackoverflow.com/tags/x86/info — Peter Cordes, Aug 10 '20 at 04:02
You're now calling `rand()` *inside* the loop, so you're benchmarking `rand` (it can be even slower than division). Also, this still isn't a [mcve] because you're missing declarations on your variables. (Also necessary to `#include` the right headers for `CLOCKS_PER_SEC`). Also, generally don't invalidate answers by editing your question. — Peter Cordes, Aug 10 '20 at 06:10
What's still an issue? Your benchmark works now, showing `&` is measurably faster than `%` when used this way with `int` runtime-variable numbers. After compiling with `gcc -O3` on x86-64 Arch Linux, `perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,arith.divider_active:u ./a.out` on i7-6700k Skylake shows that the integer divide unit was active for 536M cycles out of 3371M total cycles for the whole program. (https://godbolt.org/z/sbEfzf has source that compiles.) — Peter Cordes, Aug 10 '20 at 06:17
(From the small diff in total time, 3371M cycles (865 ms) vs. 2992M cycles (768 ms), we can infer that most of the time was spent in `rand()`. Although really superscalar out-of-order execution means that the CPU actually "spends" in multiple places at once... But anyway, a proper asm benchmark like the linked duplicate shows could have measured AND throughput at 0.25 cycles, and `idiv` throughput at 6 cycles, a perf ratio of 24x if that's the *only* operation you're doing. — Peter Cordes, Aug 10 '20 at 06:22
@PeterCordes How do you exactly covert millions of cycles to milliseconds? For example `3371M cycles (865 ms)`. Do you use the frequency in `cycles:u`? — foki, Aug 10 '20 at 14:46
The average frequency on that `perf stat` result line is cycles / task-clock. Don't convert, just look at task-clock to see how much CPU time your process used, if you want CPU time. But it makes more sense to measure in cycles for this purely ALU-bound task that doesn't have any cache misses: it's independent of CPU-frequency variation. — Peter Cordes, Aug 10 '20 at 16:37

score 3 · Accepted Answer · answered Aug 10 '20 at 03:46

3

Your measurement code is invalid. Any compiler that's not a complete joke will not emit any code for the for loop, since the result is unused. Even if it were used, since it's the same on each iteration, it would be performed only once, and the time spent on the single operation would be lost in measurement noise.

To measure this correctly, you need to

create an obligation to actually perform a computation on each iteration
ensure that the clock calls (or whatever other function you use to perform timing) are ordered with respect to the operations; otherwise the whole computation can just be reordered before/after both of them

One method that should work in principle to do this is using volatile objects, where each access should be considered by the compiler as having side effects. However, that will add constant memory loads/stores to the operation whose time you're measuring.

answered Aug 10 '20 at 03:46

R.. GitHub STOP HELPING ICE

208,859
35
376
711

@Thank you. I did not include all the code I'm running. It is now updated to better reflect it. What do you mean by `clock` calls to be ordered with respect to the operations? In disassembled binary, `clock@GLIBC_2.2.5` calls are before and after the loop. – foki Aug 10 '20 at 06:05
@foki Even in the updated code: why should the compiler evaluate the expression in the loop? It's result is not used. Make it so the result is used. – fuz Aug 10 '20 at 07:39
@fuz Changes are reversed. But let's say that I accumulate the outputs of the expression in the loop into a variable and return that variable. And also, I have disassembled the binary to make sure nothing significant is optimized out. I don't focus on the compiler, just use it to generate assembly. Peter has tackled this in the comments of the original question. Seems like even then it is not trivial to quantify the difference. – foki Aug 10 '20 at 14:20
@foki if you accumulate the results into an expression, that can work. But beware: if it's always the same expression you add, the compiler will get rid of the loop. – fuz Aug 10 '20 at 14:49

How to prove that AND instruction is faster than IDIV on a commodity Intel x86

1 Answers1