19

Is there a quick/easy way to do this (for at least a rough estimate) ?

I'm benchmarking algorithms and I thought it would be cool to know the absolute speed at which my computer is executing instructions and compare that with my asymptotic analysis.

ordinary
  • 5,943
  • 14
  • 43
  • 60
  • What kind of instructions did you have in mind (and why C/C++? Why not assembly?) – WhozCraig Aug 19 '13 at 07:21
  • generate an assembly file and count the instructions. Then do performance measurements of the same file. – duedl0r Aug 19 '13 at 07:22
  • If you're interested in "instructions per second", look at your clock speed, figure out the average number of cycles per instruction on your architecture, and do the division. Otherwise you need a more meaningful, less literal metric. – Chris Hayes Aug 19 '13 at 07:22
  • 11
    No, there's really not (a simple way). The problem is fairly simple: the number of instructions your computer can execute depends (heavily) on the mixture and order of those instructions. Getting a result that means anything typically involves the result from some well-known benchmark -- but meaningful benchmarks are usually fairly complex. Even then, you have to be careful -- the numbers often mean less than expected. – Jerry Coffin Aug 19 '13 at 07:22
  • ["how to achieve 4 FLOPs per cycle"](http://stackoverflow.com/a/8391601/1743220) covers reaching the peak performance and could be used as a performance-estimating benchmark. – Thomas B. Aug 19 '13 at 07:45
  • @JerryCoffin: not to mention the effect of cache. The closest to a "rough estimate" would be a known set of benchmarks such as Whetstone/Dhrystone. – DanielKO Aug 19 '13 at 07:58
  • 1
    @DanielKO The Whetstone benchmark was originally written in Algol, based on statistics gathered in about 1970 from the British National Physics Lab, but using only a four element array to test array access. It was then translated into Fortran. Dhrystone was based on similar principles, in different languages, written in Ada but translated into C. I thought Whetstone was outdated and unrealistic when I first studied it, about 30 years ago. – Patricia Shanahan Aug 19 '13 at 08:26
  • @PatriciaShanahan: Somebody has been browsing Wikipedia... Old or not, realistic or not, it's still present on modern CPU benchmark applications (a quick web search reveals SiSoft's Sandra as one example). You might be surprised, but after 30 years, 4-element arrays are still widely used; we even design whole architecture extensions around that. Being old and still in use qualifies as "known". – DanielKO Aug 19 '13 at 08:42
  • 2
    @DanielKO The point is that four element arrays and loops accessing them can be optimized in ways that are not applicable to larger arrays, and the four is the only array size in Whetstone. Although I checked the dates with Wikipedia, I was aware of the history of Whetstone long before Wikipedia existed. – Patricia Shanahan Aug 19 '13 at 08:48
  • 1
    'This 30 year old benchmark is still widely used, *and therefore it is the closest you can get to a "rough estimate"*'. Did I miss a vital piece of information that would make *that* conclusion seem logical? :) – jalf Aug 19 '13 at 09:17
  • 3
    If you take the original Dhrystone benchmark and compile it with gcc -O3 (a few year ago), you get fantasy numbers, because (at least) one of the loops gets made into nothing, and thus takes zero time. Instructions per second is number of instructons / time -> infinite number. But the total benchmar is not zero, time, so you end up with some fantasy number in the 100-1000x the theoretical of the processor. There are tricks you can do to make the compiler believe you need the code inside the loop, but that's not the original source any longer. Also, Drhystone is based on VAX instructions. – Mats Petersson Aug 19 '13 at 09:30
  • @jalf Popularity is not enough to make a benchmark any use, even as a "rough estimate". I would rather go with any of the benchmarks that use one or more realistic kernels such as one of the [SPEC benchmarks](http://www.spec.org/benchmarks.html). Although obtaining the code costs, the web site has results for many system. Or even use something as simple as [linpack1000](http://www.netlib.org/benchmark/). Solving a system of 1000 linear equations is a meaningful task. The user's own code is the best benchmark of all for measuring instructions per second on the user's code. – Patricia Shanahan Aug 19 '13 at 15:00

5 Answers5

27

If you want to know what your CPU can do, then look at the documentation. Your CPU vendor specifies the latency and throughput of all instructions, as well as a variety of other information (how many instructions can be issued or retired per cycle, cache latencies and much more). Based on this, you can compute the theoretical peak throughput.

If you want to do what your CPU is actually doing then run your own code and measure its performance.

However, keep in mind that modern CPUs are really complex beasts, and their performance depends on a wide variety of factors, and you will very rarely be able to come anywhere near maxing out your CPU, and understanding why, or what exactly is holding your code back requires a fairly thorough understanding of the hardware. (My usual rule of thumb is that you're doing very good if you get a sustained 30-40% of the theoretical peak FLOPS)

jalf
  • 243,077
  • 51
  • 345
  • 550
  • Or just let the kernel give you the [BogoMIPS](http://en.wikipedia.org/wiki/BogoMips). It's at least as useful as any theoretical (that is, disconnected from any context) estimation. – DanielKO Aug 19 '13 at 07:47
  • @DanielKO except "what the documentation says" is hardly "disconnected from any context". It is hard factual information about how your CPU works and what it is capable of, which is quite relevant if you are trying to make your code perform well. But yeah, if you just want a quick estimate which throws away a lot of nuances, then that number might be a very good candidate. – jalf Aug 19 '13 at 07:54
  • 1
    @jalf: instructions are not executed independently, so even a thorough description of how each instruction is executed reveals little of what is actually going to happen during execution. Cache miss, branch mispredictions, data dependencies, etc, this is part of the context I mentioned. – DanielKO Aug 19 '13 at 08:08
  • 2
    @DanielKO yes. None of that contradicts my answer in any way, does it? But if you want to know the maximum theoretical throughput your CPU is capable of given perfectly optimal code, then you assume that there are no cache misses, branch mispredicts or data dependencies. All of these can help account for why *your* code is much, much slower than this theoretical maximum, *which is the entire point*. – jalf Aug 19 '13 at 09:11
8

This is a typical case of "In theory, theory and practice is the same, in practice they are not".

Modern CPU's have very sophisticated logic in them, which means that the ACTUAL number of operations performed is different from what you'd think from just looking at code or thinking about the problem [unless you have a brain the size of a small planet and know how that particular CPU works]. For example, a processor may speculatively execute instructions on one or another side of a branch, even if it hasn't quite got to the branch - if that's the "wrong" side, then it will discard the results of those instructions - but of course it took time to execute them.

Instructions are also executed out of order, which means that it's hard to predict exactly which instruction will execute when. There are some exceptions.

You will only get (anywhere near) the theoretical throughput if you are pushing data and instructions through all the available execution units at once - this means having the right mix of instructions, and of course ALL of the code and data in caches.

So, in theory we could stuff the processor full of instructions that maxes it out, by writing very clever code. In practice, that turns very very very quickly into a hard task.

However, the question is about measuring the throughput of instructions, and on modern CPU's, this is very much possible with the right extra software. On linux perftool or oprofile, for windows there is Intel's VTune and AMD's Code Analyst. These will allow you (subject to sufficient privileges) to fetch the "performance counters" in the processor, which has counters for "number of instructions", "number of float operatons", "number of cache misses", "branch mispredicted" and many, many other measurements of processor performance. So given a sufficient length of runtime (at least a few seconds, preferrably more), you can measure the actual count or clock cycles that a processor performs.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • 5
    "... that turns very very very quickly.", can we attach a dynamo and use it as a source of energy? – DanielKO Aug 19 '13 at 08:45
  • For "fun with theory"; on modern CPUs (e.g. Intel Nehalem and later, with "loop stream detectors"), I'd consider trying a loop containing single-byte `NOP` instructions (so the instructions get discarded by front-end and don't make it to micro-op buffer). I'm guessing you can probably exceed "100 instructions per cycle in theory" this way. – Brendan Feb 21 '19 at 12:52
  • @Brendan: no, Intel CPUs at least do run NOPs through the whole pipeline. They take a slot in the ROB, but zero in the RS (unfused domain: no execution unit needed). This is definitely true for SnB-family, but I haven't tested Nehalem. Discarding them before issue into the back-end might be practical, but it's not a very valuable optimization. Presumably not worth the trouble of having the first instruction after the NOPs start at a RIP that isn't the end of the previous instruction, with no jump. Also, performance counters for "instructions" would be wrong. (not a dealbreak, though.) – Peter Cordes Feb 21 '19 at 23:28
4

In practice these days, the effective number of instructions depends primarily on memory latency, which is the main bottleneck on performance. Waiting for data is bad. Processors can alleviate this problem somewhat with techniques such as caching, pipelining and concurrency, but the issue remains and will only get worse over time.

Proper implementation can make a huge difference. You may want to check out this question about cache-friendly code.

Community
  • 1
  • 1
Marc Claesen
  • 16,778
  • 6
  • 27
  • 62
2

Modern CPUs are pipelining instruction processing, so there is no constant as such.

You could however, read out number of CPU ticks at start of your algo and at the end. I think this is as low level as you can get with such measuring.

http://en.wikipedia.org/wiki/Time_Stamp_Counter

Note: There are a lot of problems why this won't be 100% accurate, I can mention few, but I am sure the community will be able to add to the list: -OS pre-empting you process -cache misses (algo will run slower the first time, faster if it's ran subsequently) -on older CPUs, the CPU ticks are not invariant to the CPU frequency

Patrik Beck
  • 2,455
  • 1
  • 19
  • 24
  • 2
    Unless (virtually) nothing else is running on the machine at all, this generally won't be very accurate at all. The time stamp counter is useful for very *short* sections of code that run in a single time slice. For something like a complete program, it usually makes more sense to get the time from the OS (e.g., `times` on Linux or `GetProcessTimes` on Windows). – Jerry Coffin Aug 19 '13 at 07:27
  • On unix systems, `clock_gettime()` is the preferred way, as you can specify how you wan the time measured (it will even map to RDTSC if you really want it); C++11 more or less incorporated it into `std::chrono`. – DanielKO Aug 19 '13 at 07:53
2

You can use Perf tool in Linux. It is easy to use.

To get statistics on CPU cycles, instructions per cycle (IPC), cache hits/misses etc., simply run your program with Perf. A sample command is

perf stat -d <exename>

For more information, visit http://www.brendangregg.com/perf.html or https://perf.wiki.kernel.org/index.php/Tutorial

Chenna K
  • 141
  • 7