4

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:

# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null

Outputs:

1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps

...but numbers returned are always system dependent, which in a sense also measures the user's hardware.

Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.

Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):

bogobench foo bar.data

And we run this on two physically separate systems:

  1. a 1996 Pentium II
  2. a 2015 Xeon

Desired output would be something like:

21 bogo-seconds

So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.


A hardware emulator like qemu might be one approach, but not necessarily the only approach:

  1. Insert the code to benchmark into a wrapper script bogo.sh
  2. Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
  3. Run bootimage.iso using one of qemu's more minimal -machine options:

    qemu-system-i386 -machine type=isapc bootimage.iso
    

But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)

Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.

Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.

agc
  • 7,973
  • 2
  • 29
  • 50
  • 2
    No. What are you trying to measure? Benchmarks are endlessly fungible; results across machines are seldom comparable. How many CPUs does your test assume? How much memory is on the machine? How much of that memory does the test use? How much disk is on the machine? How fast is it? How solid-state is it? What else is going on while the test is running? Etc, etc, etc. The question is, I'm afraid, too broad to be a good fit on SO. – Jonathan Leffler Nov 04 '16 at 21:14
  • @JonathanLeffler, the Q is about measuring the (virtual) time need to run a single-threaded util -- the host hardware should not be relevant, we settle for some arbitrary standard, and go with that. What those minimal standards might be is besides the point. (For the moment please assume memory and disk are not relevant.) Think of it like [*MAME*](https://en.wikipedia.org/wiki/MAME), where an '80s video game will run at the same real-time speed on most any post 2000 hardware -- except here, (unlike *MAME*), we don't even care about *real-time*. – agc Nov 04 '16 at 21:46
  • Sorry, I don't see the point. If you want to measure the effort of `seq` to count to a million then run an `strace` (or similar utility) and count the output lines? This way you have factored out the execution time. If a compiler can make it run with fewer steps with different options (or compiler) OR hardware, then you have measured a difference unrelated to CPU speed etc. And then how can you meaningfully used that info? Getting your system run as fast as possible given hardware and environment constraints is enough of a challenge ;-) (IMHO). Sorry, agree w JL above too broad, but good luck! – shellter Nov 04 '16 at 22:11
  • @shelter, the point is to obtain an execution time, albeit a more invariant one. Re *"enough of a challenge"*: such an invariant benchmark utility could help with that. By `seq`, please read `foobar`, (i.e. any generic util), by specifying a less I/O bound util, I'd vainly expected to curtail non-relevant considerations of I/O. – agc Nov 04 '16 at 23:38
  • 1
    Ah, [standards](https://xkcd.com/927/). – Elliott Frisch Nov 04 '16 at 23:55
  • Yes, I understood that it was really `foo`, but as you mentioned `seq` it seemed more appropriate to keep things tied to a real example. Can you explain (in your Q, preferably), what you will use this invariant benchmark for once you have it? Are you looking to create a system model that will 'calculus' together all these values for a particlar design (or something like that)? Good luck. – shellter Nov 05 '16 at 03:15
  • 1
    @shellter, I've tweaked the OP a bit. The value of such a util would be so general that it's like asking for a specific use of `bc`. An unconventional use might be called *inverse benchmarking*, where *system Y* could attempt to calculate a reasonably good estimate of how long (in *real time*) a util would take to run on *system X*. – agc Nov 05 '16 at 04:32
  • It sounds like you want a machine simulator that will do a cycle-accurate simulation of a standard reference machine, so you can run different benchmarks *on it*. Your question title is asking for a benchmark, not a platform to run benchmarks on. – Peter Cordes Nov 06 '16 at 14:12
  • Anyway, I don't think this is as useful as you think it is. When we care about the speed of a real program, we care about it running on real hardware. Unless your simulator's CPU-specific quirks are approximately similar to some real hardware, function A could be faster then function B on the simulator, but slower than B on a real i5 4670 (Haswell with 6MB of L3 cache) or something. – Peter Cordes Nov 06 '16 at 14:17
  • As discussed in [answers and comments on the Collatz conjecture asm question](http://stackoverflow.com/q/40354978/224132), microarchitectural difference can matter, even for pure CPU-bound code (no memory access), between Intel Core2 (Merom), AMD Bulldozer, and Intel Haswell. This matters when it comes to tradeoffs like more memory access vs. more CPU computation. Probably the best approach is what people do now: time two versions of code on the same hardware, and say "**my changes give a 25% speedup vs. your code, on my hardware**". Good way to find perf problems on specific hardware, too. – Peter Cordes Nov 06 '16 at 14:22
  • Still, voted to reopen, now that I think it's clear what you're asking. But please consider editing the title. I didn't myself in case I'm misunderstanding what you're asking, or not understanding the use-case you have in mind. – Peter Cordes Nov 06 '16 at 14:23
  • 1
    @PeterCordes, thanks for the feedback and also for the vote -- I'd like a reasonably approximate but not necessarily perfect non-subjective command-line benchmark util. _How_ it might work is secondary. The *emulator/simulator* idea is one approach, but it seems premature to suppose it's the _only_ possible way -- I mention *emulators* more as a "*brute force*" method to show it's not technically impossible. Agreed that *simulation* can be imperfect, yet that doesn't make it altogether useless or unimprovable. – agc Nov 06 '16 at 19:10
  • 1
    Compiler optimization is not relevant, as it pertains to the wrong level of abstraction -- to eliminate this level just suppose `foo` consists of incrementing a register *100000000* times, i.e. suppose `foo` were provably immune to compiler optimization. So `bogobench` would measure that level of code. The fact that *any* code might be optimized, (whether by hand or compiler -- which is ultimately done by hand, since humans write compilers), is not relevant. – agc Nov 06 '16 at 19:16
  • Is that last comment intended for @shellter? – Peter Cordes Nov 06 '16 at 19:18
  • @PeterCordes, well both you and he, yes... – agc Nov 06 '16 at 19:21
  • 1
    Ok, I still have no idea what kind of thing you're hoping to measure with this. Do you want this benchmarking system to take x86 binary executables as input, or does it take C or C++ programs as input? If your benchmark system doesn't try to simulate execution times on any real hardware, it becomes a new platform that compilers could optimize for. There's no guarantee that tuning your source code to do well on your system is better than tuning it to do well on whatever real hardware you happen to have, and many reasons to suspect it will be less relevant if you ultimately care about real HW – Peter Cordes Nov 06 '16 at 19:25
  • I have tried, but literally can't imagine any case where I'd want to use anything like what you described, unless it was at least close to an accurate simulator for some real hardware. Can you suggest some use-cases, maybe with an edit to the question? – Peter Cordes Nov 06 '16 at 19:27
  • Since you linked BogoMIPS, you realize that BogoMIPS measure only the performance of that delay loop, and isn't very strongly correlated with anything else, right? e.g. an Intel P4 and an Intel Skylake both have bogomips ~= clock * 2.0, but when running typical code (other than Linux's delay loop), Skylake averages more than twice the instructions per cycle, IIRC. – Peter Cordes Nov 06 '16 at 19:38
  • 1
    @PeterCordes, sure *BogoMIPS* are **bog**_us_, yet it's still better than nothing, and such a benchmark should be used in the same spirit of skeptical compromise. – agc Nov 06 '16 at 19:47
  • "*yet it's still better than nothing*". Better for what, and in what case? I think any program that wanted an approximation of how fast the current system is would do much better by including its own micro-benchmark that timed some function it actually cares about. BogoMIPS these days is not much if any better than the clock speed, now that even x86 CPUs are pipelined. – Peter Cordes Nov 06 '16 at 20:15
  • 1
    You realize BogoMIPS is a "benchmark" itself, but you're talking about a system for benchmarking arbitrary code, aren't you? I don't really see the connection there, either. If you want to get a rough guess of how fast something will run on another system, scaling the results of your system by the BogoMIPS ratio between those systems is not even close to good. If you were going to pick a single number as a scale factor, some kind of integer or FP microbenchmark would make much more sense than BogoMIPS. e.g. SPECint2006 ratios would not be a terrible choice. – Peter Cordes Nov 06 '16 at 20:18
  • 1
    @PeterCordes, fair enough, those are all good points. I'd mostly meant the "*bogo*" prefix as an adjective indicating *vague approximation*, (via some unknown benchmark), rather than a noun signifying adoption of BogoMIPS's time loop algorithm. I'l try editing that confusion out... – agc Nov 07 '16 at 00:37
  • re: your last sentence: *Whereas a more virtual benchmark ought to be less susceptible to such influences.* That's the problem: it can give similar (or identical) results across different hardware, but only by having everyone experience the same potentially-huge quirks, not by removing them. – Peter Cordes Nov 07 '16 at 05:26
  • @PeterCordes, sure there'd be quirks with a virtual machine method, but they'd be more or less _known_ quirks. Whereas even one's own physical hardware can be mysterious, let alone what's running on it. A set of known quirks is generally more manageable than an set of unknown ones. – agc Nov 07 '16 at 06:31

2 Answers2

4

The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.

AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).

While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.


But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.

From that link:

Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.

So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.


For [fastest-code] challenges over on the Programming puzzles and Code Golf site, @orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.

There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • To be fair, you don't really need a cycle accurate simulator for modern x86 hardware to satisfy at least some uses for the OPs bogobench. You just need a simulator that covers _enough_ of the behavior of the type of hardware you are trying to target to be useful. For example, just fix your width and instruction latencies at some reasonable values and use a reasonable model of caching. Assuming the idea that this is a long term stable benchmarking tool, you don't even really _want_ to encode too much of the current quirks since they change over time. – BeeOnRope Nov 17 '16 at 17:19
  • 1
    For example, if you had programmed in the quirks of instruction pairing of the early Pentiums, it would not be very useful today. The trend in x86 seems in many ways to be towards more general secheduling, more standard latencies, fewer crazy stalls, and so on. So a fairly simple model may be as close or closer to a machine 10 years from now, than a cycle accurate simulation of Haswell would be. – BeeOnRope Nov 17 '16 at 17:22
  • @BeeOnRope: Yes, close-to-accurate would be good enough, and I agree that the trend of fewer quirks is likely to continue. My point was that tuning for something other than current hardware doesn't necessarily help you run fast on current real hardware with current real compilers that tune for it. Otherwise sure, just pick anything, like MMIX. Having it be somewhat close to hardware you care about is useful, though, especially if you can model cache behaviour at all reasonably, and ideally even simulate a multi-core machine with realistic synchronization costs. – Peter Cordes Nov 17 '16 at 19:42
  • 2
    I can see a lot of really awesome uses for this, beyond tuning. For example, you currently can't really write continuous integration tests that test for performance regressions because you don't have a stable deterministic benchmark function: `long time = f(x86 binary)` that spits out the same number every time. _Actually timing_ the code is not going to stable enough for something that is pass/fail, and isn't consistent across machines (or even run-to-run on one machine), but this bogobench would give the same result each time. Everyone in the world can get the same results. – BeeOnRope Nov 17 '16 at 19:48
  • 1
    Think about `IACA` - that's already pretty useful despite being very limited (in particular, it doesn't run the code so it's only useful for basic blocks). This is like IACA on steroids. You can also imagine it being useful for "cost" functions for stuff like STOKE or generally anything that needs a deterministic performance number. Of course, its never going to replace real-world testing for tuning purposes. – BeeOnRope Nov 17 '16 at 19:49
  • @BeeOnRope: catching performance regressions is actually a really good point. That's the first suggestion I've seen for a real use-case for this that doesn't require much accuracy to be useful. It wouldn't have to be very accurate to catch changes that accidentally broke auto-vectorization of important loops. – Peter Cordes Nov 17 '16 at 19:55
3

One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.

Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.

Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.

The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.

Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).

To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).

That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.


1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.

2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

agc
  • 7,973
  • 2
  • 29
  • 50
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386