0

I've been tasked to generate a benchmark program that estimates the MIPS of an x86 system using C. My approach is run an empty for loop for a large amount of iterations. I will then measure the execution time of this to determine the MIPS. However, I need to know the number of instructions found in a single for loop iteration.

#include <stdio.h>
#include <sys/time.h>

int main(int argc, char *argv[])
{
    size_t max_iterations = 1000000000;

    // grab start time
    
    for(int i = 0; i < max_iterations; i++)
    {
        // empty
    }

    // grab end time and calculate MIPS

    
    printf("MIPS = %f\n", max_iterations * instruction_per_cycle / 1000000.0 / elapsed_sec);

    return 0;
}

I'm unfamiliar with the x86 instruction set, however, for the for loop I've provided it seems like the following items could be instructions:

  1. load value i from memory to register
  2. load value max_iterations from memory to register
  3. perform comparison between i and max_iterations
  4. increment i
  5. write new value of i to memory
  6. jump into loop assuming
  7. jump back to start of loop statement
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Izzo
  • 4,461
  • 13
  • 45
  • 82
  • it's zero, if you compile with optimizations on. And benchmarking these things are pointless – phuclv Sep 25 '21 at 08:00
  • *My approach is run an empty for loop for a large amount of iterations.* That will seriously under-estimate the actual instructions-per-cycle modern CPUs are capable of. If the loop counter stays in memory (debug build), it will bottleneck on that dependency chain, nowhere *near* the bandwidth of the rest of the CPU. Even in a tight loop that compiles to `top:` / `dec ecx` / `jnz top`, it'll run one cycle per iteration, but only 1 uop (macro-fused dec-and-branch), when the pipeline is 4 uops wide on Skylake for example. Or wider on Zen or Ice Lake. – Peter Cordes Sep 25 '21 at 10:11
  • 1
    Anyway, IPC (instructions per cycle) depends on the code being executed. The peak is about 5 (or 6) instructions on a pipeline that's 4 uops wide ([What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?](https://stackoverflow.com/a/37062887)), but real code like SPECint2017 usually run at more like 1.7 on Haswell (https://www.researchgate.net/publication/322745869_A_Workload_Characterization_of_the_SPEC_CPU2017_Benchmark_Suite) (with some bottlenecks from cache misses), when compiled with optimization by a modern compiler like GCC or clang. – Peter Cordes Sep 25 '21 at 10:22
  • 1
    Related: [How many asm-instructions per C-instruction?](https://stackoverflow.com/q/331428) (you can't guess from the source, you have to look). Also [What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?](https://stackoverflow.com/q/37041009) shows a loop that runs 5 instructions per clock cycle on most modern x86 CPUs, so MIPS = 5x frequency in that best-case scenario. – Peter Cordes Sep 25 '21 at 10:27

1 Answers1

2

Things I did to view the disassembly, which probably should help you get what you need...

I wrote a simple function with a mundane for loop in it's body, and saved to a file for.c

void loop()
{
    
    for(int i = 0; i < 10; i++)
    {
        // empty
    }
}

Then I ran

gcc -S for.c

which in turn is to ask gcc to emit the assembly code, and the resultant assembly code is generated in for.s. After which I ran as(GNU Assembler) asking it to produce the object file for.o with the following command

as -o for.o for.s

which generates the object file for.o, and further to which I asked the utility objdump to show me the disassembly of the object file using the following command...

 objdump -d for.o

which shows me an output like this...

for.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <loop>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
   b:   eb 04                   jmp    11 <loop+0x11>
   d:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  11:   83 7d fc 09             cmpl   $0x9,-0x4(%rbp)
  15:   7e f6                   jle    d <loop+0xd>
  17:   90                      nop
  18:   5d                      pop    %rbp
  19:   c3                      retq

But this also has instructions related to stack as I wrote the loop inside a function. Typically, only for for loop will be fewer instructions than what we see currently in the disassembly.

x86_64 architecture that would be where I ran all these, and used gcc to compile. So, please pay attention to the tools that you are using.

There may be other ways to achieve the same, but for now I can suggest this way, if it helps you.

Nalin Ranjan
  • 1,728
  • 2
  • 9
  • 10