Write a code which runs in more cpu cycles on newer cpu than on old cpu

Question

I'm looking for a code which will cause perf degradation when moving to newer cpu. I know this is theoretically possible, but I'm having hard time finding example which would work.

Some constraints:

It should be single threaded
It should be compiled for either i386 or oldset x86_64 or be handwritten assembly
If compiled it should statically link against all libraries so that libc can't load optimised versions of libraries at runtime
Clock cycles can be approximated as time of execution / max frequency. Or some perf tool can be used. This is in order to avoid some RISC code which would run blazingly fast on 4GHz pentium 4s.

My current idea is to overload instruction issue buffer with branches, but have no idea how to implement that effectively. Other approaches are welcome. The more ways to sink perf, the better.

A more recent example is using vector shuffles, which had a throughput of 2 on Nehalem and the bridges, but it's back to 1 on the wells and lakes. — harold, Jul 11 '17 at 17:22
By syncing memory modifications and clearing caches you would be probably able to slow down to the memory speed on any future architecture, and as the speed of memories is growing slower, it will feel as degradations in terms of CPU performance (but in absolute time it will still be faster in the future). And of course some idling loop based on `CPUID` returned value.... will probably render your SW unusable in few years. — Ped7g, Jul 11 '17 at 17:35
Haswell has higher latency on lane-crossing AVX shuffles than Sandybridge/IvB. (3c vs. 2c). Also, Haswell only has one shuffle unit, while earlier Intel could run 2 vector shuffles per clock. — Peter Cordes, Jul 11 '17 at 19:54
More generally, look at [Agner Fog's instruction tables](http://agner.org/optimize/) for something with higher latency in a newer CPU than an older CPU, and use it in a loop-carried dependency chain. Related: [Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs](https://stackoverflow.com/questions/37361145/deoptimizing-a-program-for-the-pipeline-in-intel-sandybridge-family-cpus) — Peter Cordes, Jul 11 '17 at 19:57
The title seems clear, but the text isn't. You mean you want a program that will take more core clock cycles on a newer CPU, right? So 1 sec on a 4GHz Skylake is the same performance as 4 seconds on a 1GHz Pentium III, 4 gigacycles when both are running at their rated clock speeds. Also, the new/old pair can be anything, like IvyBridge -> Haswell, or 80286 -> Skylake? What about across vendors, like Intel Pentium II vs. AMD Bulldozer? Hrm, actually I can see why this got closed as too broad. There are a zillion possible answers. OTOH, I could give a generic answer for how to find cases. — Peter Cordes, Jul 11 '17 at 21:54
Does the program have to do anything useful, or can it just run a synthetic loop that does a chain of dependent `BSF` instructions for 100M iterations and then exit?. (Big perf drop from Intel to recent AMD). — Peter Cordes, Jul 11 '17 at 21:57
I think you could clarify your question by adding some ranges for what you mean by "new" and "old". For example, are you interested in today's architectures versus 5 years ago, 10 years, 20 years? It is clear enough to me from the title, but you should probably also clarify that you are interesting in measuring in _cycles_ and not in _time_ (and then your clarification on how to convert between _cycles_ and _time_ makes sense). — BeeOnRope, Jul 12 '17 at 19:20

BeeOnRope · Answer 1 · 2017-07-12T00:42:19.300

3

The Pentium 4 had a double pumped ALU, so pretty much any simple chain of dependent ALU ops will execute at two ops per cycle on a P4, but one op per cycle on all recent architectures.

For example:

top:
or eax, eax
or eax, eax
or eax, eax
or eax, eax
...
sub ecx, 1
jnz top

Beyond that, (much) older architectures had single cycle memory access, and later access in a handful of cycles, while today memory access is hundreds of cycles. So anything which depends on memory latency will often run in fewer cycles on older architectures. The simple example is a pointer chasing loop.

Similarly for mispredicted branches: the short(er) pipelines of older architectures meant that mispredicted branches had a shorter penalty in cycles. This penalty probably peaked around the P4, then came down to around 15 cycles had has been relatively steady since.

edited Jul 12 '17 at 00:42

answered Jul 11 '17 at 17:16

BeeOnRope

60,350
16
207
386

Yeah, that ALU chain might be doubly fast on P4, but as soon as you hit one mispredicted branch (inevitable at least once), virtually all of that speed advantage will evaporate. The processor's pipeline is as long as the earth—31 stages on Prescott. – Cody Gray - on strike Jul 12 '17 at 18:46
@CodyGray of course :). No one is arguing that the P4 is somehow generally better than current architectures, but the loop above can be made almost arbitrarily long (larger body and/or bigger tripcount), at which point the mispredict penalty for this loop will be almost arbitrarily small. I think the example is interesting because it's one which actually uses common operations and is faster both in cycles and in "real time". Many examples like memory access are faster on older stuff in cycles but only because frequencies were very low, so they are not faster when measured in time. – BeeOnRope Jul 12 '17 at 19:16
... and I'm also curious why we have never seen a double-pumped ALU make a re-appearance. I don't think a long pipeline was a prerequisite for the double pumped ALU, but more orthogonal (indeed, the fact that the _latency_ of dependent operations was 0.5 cycles shows that the ALU itself wasn't really pipelined (in the traditional sense - I think there was some "width pipelining" where the first half of the result fed into the next operation in the first cycle, with the second half coming in the next cycle). I guess it's just too restrictive to design an ALU like that. – BeeOnRope Jul 12 '17 at 19:18
1

I'm sort of surprised we haven't seen it re-emerge, also. There were basically two ideas in P4/Netburst that were good: the double-pumped ALU and the µop cache. The latter made it into Sandy Bridge, but the former is still MIA. *My* guess would be that it's simply too expensive. The only reason they had a double-pumped ALU in Netburst was because they had no choice if they wanted real-world performance anywhere close to the same ballpark as the P6, especially in the earlier revisions. They aren't forced into doing it anymore, so they haven't. We've got more cores now to play with, I guess. – Cody Gray - on strike Jul 12 '17 at 19:26
1

Yeah, although the uop cache in SnB is quite different than the trace-cache in P4 (neither is strictly better than the other, but I think the trace-cache had more glass jaws). I don't remember if the 64-bit variants of the P4 were also double pumped with the same performance? Maybe the width-pipelining that was feasible for 32-bit ops (using 2 chained 16-bit ALUs) just didn't work for 64-bit operations (since effectively you need to propagate across 128 bits in a single cycle for something like `add`). Asked [over here](https://stackoverflow.com/q/45066299/149138). – BeeOnRope Jul 12 '17 at 19:46

Write a code which runs in more cpu cycles on newer cpu than on old cpu

1 Answers1