2

In the context of preparing some presentation, it occurred to me that I don't know what the theoretical limit is for the number of integer operations a Haswell core can perform at once.

I used to naively assume "Intel cores have HT, but that's probably parallelizing different kinds of work, so probably a core maxes out its parallelism with 256-bit AVX operations, so 8 integer ops which can be issued per clock cycle (and assuming nice pipelining, 8 completing as well)." - so 8 ops/cycle.

But then I noticed this article, which tells me Haswells (and Sandy Bridges) have at 3 dispatch ports which can feed vector units. So is the true figure 24 integer ops/cycle?

PS - I realize that in practice you might need to actually read all of that data from memory and its bandwidth would be the limiting factor. Or it'll be QPI that's too slow.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Wikipedia http://en.wikipedia.org/wiki/Haswell_(microarchitecture) says they can have up to 8+ physical cores, and "eight execution ports per core", with a "14- to 19-stage instruction pipeline". Caching makes obtaining real performance values difficult, hence why benchmarking software randomizes the cache. HT does NOT give 100% performance gain (two threads into one core), because each core shares most of it's hardware between two threads. I'd say make an assembler test program, benchmark it, then give real-world results. http://masm32.com/board/ has several good articles about this. – rdtsc May 17 '15 at 19:51
  • @rdtsc: I was talking about a single core - and about the theoretical limit, not what you can achieve in practice. – einpoklum May 17 '15 at 21:05
  • At the risk of sounding pedantic, what is the value of theoretical limits when they are unattainable? – rdtsc May 18 '15 at 00:33
  • 2
    @rdtsc: The fact a limit is unattainable due to to external factors (e.g. memory bandwidth) does not make it meaningless or useless. It provides a context for evaluating the attainable values; and suggests the importance of removing obstacles to attaining it. – einpoklum May 18 '15 at 05:51
  • In x86 architecture, "obstacles" aren't removed. Speeds increased, sure, but backwards-compatibility is paramount. The x86 can still run code from 30 years ago. Instead the motto is "add a new feature." It is an amalgamation of "features" tacked on top of each other for decades. So petitioning for another port or more cache may increase the theoretical limits, but the net result will always be *more complexity and code-and-cache dependence.* – rdtsc May 18 '15 at 11:43

1 Answers1

5

The theoretical maximum is 25 32 bit integer ops per cycle:

  • Port 0: 1 scalar op or 1 vector shift-by-constant or bitwise boolean op
  • Port 1: 1 scalar op or 1 vector add/sub/min/max or cmp or bitwise boolean op
  • Port 5: 1 scalar op or 1 vector add/sub/min/max or cmp or bitwise boolean op
  • Port 6: 1 scalar op (or 2, if you count SWAR with a 64bit integer register).

Since vector ops can do 8 32 bit operations, there is a maximum of 25 integer operations per cycle - 8 each for ports 0, 1, and 5 and 1 for port 6. Or 26 when SIMD-within-a-register on p6 is viable. (See Paul Clayton's comment.)

If we're just talking about "normal" integer stuff (add/multiply/bitwise/shift), then we have to exclude do 32bit multiplies (other than by power-of-2 constants) if we want to achieve 25 ops per clock. Real integer code will often be able to keep p0 busy with multiplies, PSADBW, shifts, and booleans, and will almost always have a significant amount of shuffling (p5). We're artificially excluding things that aren't strictly eight 32bit ops per clock throughput, like multiplies, variable-count shifts, and data movement between integer and vector registers. (MOVD / MOVQ).

Vector multiplies run on p0, but VPMULLD (eight 32x32 -> 32b multiplies) only runs at one per 2 cycles, since it takes 2 dependent uops (10c latency). See http://agner.org/optimize/ for instruction uop/port/throughput/latency tables.

Sustaining this throughput in the frontend will require the loop buffer, so keep the loop smaller than 28 uops (or 56 without hyperthreading). This includes the compare-and-branch loop overhead, so the theoretical throughput is actually slightly below 25. macro-fused compare-and-branch runs on p6, though, so it only displaces every 7th scalar op, making the sustainable throughput something like 24.85 ops per clock. (Or 25.85 with SWAR).

Another source describing Haswell's microarchitecture.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Craig S. Anderson
  • 6,966
  • 4
  • 33
  • 46
  • Ok, fair enough, but - can the pipeline theoretically sustain this? e.g. decode 4 instructions per cycle for feeding these ports? – einpoklum May 18 '15 at 05:52
  • 1
    According to Abner Fog - yes. See http://www.agner.org/optimize/blog/read.php?i=285#285 – Craig S. Anderson May 18 '15 at 06:21
  • Also, do you have any idea why the first port can do vector multiply but not vector add? – einpoklum May 18 '15 at 07:37
  • @einpoklum, [one can only speculate to why there is only one port for add](https://stackoverflow.com/questions/28861416/for-xmm-ymm-fp-operation-on-intel-haswell-can-fma-be-used-in-place-of-add/28873673#28873673). Mysticial wrote "I think he FPU on port-0 for Haswell can only handle 5-cycle instructions. It doesn't have "early-out" logic that lets it handle both 3 and 5-cycle instructions. FP-add is a 3-cycle instruction, therefore it can't go into port-0." – Z boson May 18 '15 at 07:51
  • @einpoklum, and in the same blog from Agner Fog in this answer (as well as mine) someone wrote "Given the 3-cycle latency of FP Add and the 5-cycle latency of both FP Multiply and FP Fused-Multiply-Add, it seems reasonable to speculate that Intel only wanted to add the extra complexity of an "early out" mechanism on one execution port (Port 1). With no need to change the latency, it is trivial to support an isolated Multiply on either Multiply-Add pipeline. Also note that the other FP execution port (Port 0) is already burdened with the logic for FP divides, which is fairly extensive." – Z boson May 18 '15 at 07:51
  • @einpoklum, so likely there are two ports for multiply and only one for add was that it was in some sense free to implement even if it's less useful in practice then two ports for add it's still better than only one port for both. – Z boson May 18 '15 at 07:53
  • 1
    @Zboson: You're talking about **FP**. This question is about Integer: Haswell can run `VPADDD` (and other vector-int add/sub/min/max) on p1/p5, but `PMULDQ` (and other vector-int multiply and multiply-add) only on p0. Integer vector multiply isn't in as much demand, and presumably duplicating that functional unit just wasn't worth the cost in transistors. They also like to keep ops of the same latency on the same ports. – Peter Cordes Jan 20 '16 at 08:43
  • 1
    @PeterCordes, thanks for the info. I feel a little silly now. I really wish Intel would implement fast 64-bit*64-bit to 128-bit vector multiplication. This is the main problem with multi-word SIMD operations (add with carry is less of an issue). – Z boson Jan 20 '16 at 08:52
  • The number is correct, but not if you restrict yourself to add/multiply. `VPMULDQ` only does four 32*32->64b multiplies. `VPMULLD`, which does eight 32*32->32b multiplies, takes 2 (dependent) uops, so it runs with one per 2c throughput (10c latency). There are other integer ops which can run on port0, though: `VPSLLD` for example is a logical left-shift. The immediate-count form runs with one per 1c throughput. Bitwise boolean stuff also runs on all 3 vector ports. Oh also: @Zboson: interestingly, the scalar integer multiply unit is on p1, so vector and scalar mul can go together. – Peter Cordes Jan 20 '16 at 09:01
  • 1
    [Technically](http://xkcd.com/1475/), the scalar-only port can perform *two* 32-bit operations as long as there is not destructive crossing of the 32-bit boundary. (A crossing can be non-destructive if the change can be ignored such as with a right shift and mask. A crossing could even be beneficial as in the case of finding the sum of a vector of 32-bit integers where the internal carry-out/in is handled naturally.) Bitwise logical operations naturally fit this requirement, but shift, add, and subtract can also fit this requirement given appropriately constrained inputs. –  Jan 20 '16 at 15:39
  • Oops! The carry-out/in comment is wrong! The carry-in would convert adding 2**32 to adding 1. (For traditional TCP checksum calculation such does not matter, but for that one would not be talking about 32-bit operations.) –  Jan 21 '16 at 11:25