Was there a P4 model with double-pumped 64-bit operations?

Question

I recall that one of the interesting features of the initial P4 micro-architecture was it's double-pumped ALU. I think Intel called it something like the Rapid Execution Unit, but basically it meant that each execution unit in the ALU was effectively running at twice the frequency, and could handle two simple ALU operations in a single cycle, even if they were dependent.

This feature disappeared at some point (before or at the same time as the P4), but was there ever a 64-bit P4 with a double dumped ALU? The 64-bit variants of the P4 came out in 2004, about four years after the initial 32-bit release, but it isn't clear to me if the double-speed ALU had disappeared by then. It seems like the width-pipelined approach used to double the speed would be difficult for 64-bit which is what piqued my curiosity.

Since one may still need to support some (evidently quite old) 64-bit P4 hardware, knowing the ALU behavior is interesting for optimization.

I'm 99% sure that all Netburst-derived processors (so all Pentium 4s) used double-pumped ALUs, and that included the later revisions (Prescott, Cedar Mill) that implemented EMT64T. I have one here that I could fire up and benchmark, if this doesn't get closed before I get a chance. :-) — Cody Gray - on strike, Jul 12 '17 at 19:48
I found some [semi-confirmation here](http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html#No%20double%20frequency%20building%20blocks%20used%20yet), under _ALU Latencies_ for Prescott (in the table). — BeeOnRope, Jul 12 '17 at 19:50
Agner Fog's tables say `add r,r` is 0.5c latency on Prescott. I expect he tested all 4 operand-sizes. He lists `imul r64,r64` as 1 uop for port 1 with 2.5c throughput. But this AIDA64 InstlatX64 result for a [Pentium 4 640 Prescott-2M](http://users.atw.hu/instlatx64/GenuineIntel0000F43_P4_Prescott_InstLatX64.txt) shows 1c latency for `add` and 2.0c latency for `imul r64,r64`. So maybe Intel did drop the double-pumped ALUs at some point. I would have thought that would be a hard thing to change, but not impossible. — Peter Cordes, Jul 13 '17 at 01:59
Oops, 2.0c *throughput* for `imul r64,r64` on a Prescott, vs. 2.5 on the Prescott Agner Fog tested. They agree on latency=10c. Agner's Prescott results don't match at all with InstLatX86 results, or with that article you found saying that Prescott had ditched the double-frequency ALUs in favour of parallel ALUs that could run 2 uops per port per cycle, but only if they're independent. — Peter Cordes, Jul 13 '17 at 02:25
For your own P4 hardware, it should be easy enough to test, right? Put `%rep 16` `add eax, eax` `%endrep` inside a loop and use perf counters (hmm, I guess that's the trick; Does Linux `perf` even handle P4? Or do you need `oprofile`). It will run at about 1.0 or 2.0 IPC, depending on `add` latency being 1 or 0.5c. You can test throughput by having some ILP. — Peter Cordes, Jul 13 '17 at 02:33
@MargaretBloom: large enough to dominate any loop overhead, or any weird trace-cache effects or any bottlenecks from number of unresolved branches (roll-back targets) in flight. `%rep 2` or 3 would probably be fine, though. — Peter Cordes, Jul 13 '17 at 12:38
The differing latency counts *may* be explained by whether you're running the CPU in long mode or not when doing the benchmark. I'm researching a theory that Prescott introduced 32-bit ALUs that would work analogously to Willamette/Northwood's 16-bit ALUs. This turns out to be a *very* interesting and highly disputed topic, with surprisingly little authoritative information readily available online, although there was quite a bit of discussion about it on technical forums back in the day. Still working on putting together a complete answer, and then confirming with tests on real hardware. — Cody Gray - on strike, Jul 13 '17 at 12:45
@PeterCordes - yeah it would be easy to test, but I don't have accessible P4 hardware at the moment (even though it may still be an optimization target). I'll clarify the question a bit. — BeeOnRope, Jul 13 '17 at 18:00
@CodyGray: http://users.atw.hu/instlatx64/ has 32-bit (instlatx86) and 64-bit (instlatx64) results for the same CPU, in some cases (including that P4 640 Prescott-2M: [32-bit](http://users.atw.hu/instlatx64/) and [64-bit](http://users.atw.hu/instlatx64/GenuineIntel0000F43_P4_Prescott_InstLatX64.txt). Presumably the 32-bit one is run in Compatibility mode under a 64-bit OS, but that should be ok. Core2 only macro-fuses cmp/jcc in 32-bit mode, but that does include compat mode. Anyway, `add r32,r32` numbers are the same in both. — Peter Cordes, Jul 13 '17 at 18:11
And instlatx86 does measure a [P4 Northwood](http://users.atw.hu/instlatx64/GenuineIntel0000F27_P4_Nortwood_InstLatX86.txt) `add r32,r32` at lat=0.5c, tput=0.35c, so the 1.0c on Prescott is probably not a measurement error. — Peter Cordes, Jul 13 '17 at 18:13
Okay, confirmed. The cycle counts reported elsewhere are accurate. `add r32, r32` takes ~0.5 clock cycles on P4 Northwood, but ~1.0 cycles on P4 Prescott. It changes nothing when running in 32-bit or 64-bit mode. In fact, it's quite curious and impressive that `add r64, r64` runs at exactly the same number of clock cycles as `add r32, r32` on Prescott. Problem is, this messes up my initial assumptions and what I find from Intel's technical papers, because this suggests that Prescott's ALUs are *not* double-pumped. — Cody Gray - on strike, Jul 14 '17 at 09:55
Hmm…or maybe that means that they *are* still double-pumped, but the results for `add r32, r32` are being artificially delayed. That is, a 32-bit result is ready on the first half-clock cycle, but it takes 2 half-clock cycles for a 64-bit result to be ready, so the processor delays the 32-bit result until the second half-clock cycle, even when running in 32-bit mode. I don't know how I would verify that. But if you think about it, that kind of makes sense when you consider 16-bit vs 32-bit throughput on Northwood. 16-bit adds *should* be done faster (half clock), but the numbers don't change. — Cody Gray - on strike, Jul 14 '17 at 10:15
That would mean that simply adding EM64T support to the core slowed *all* integer operations down, even when running in 32-bit mode (and even on Prescott editions that don't support EM64T; yes, I tested one of those, too). That pretty much sucks. Virtually no one was benefitting from 64-bit mode on these chips, yet everyone was paying the price for its inclusion. — Cody Gray - on strike, Jul 14 '17 at 10:17
@CodyGray - huh, that would be quite the change: unless it was counter-acted by some other architectural improvements I imagine it would have had an IPC impact on various benchmarks. — BeeOnRope, Jul 14 '17 at 22:12
Quite an interesting [thread](http://www.realworldtech.com/forum/?threadid=54405&curpostid=54405) on the RWT forums which covers this topic. — BeeOnRope, Jul 16 '17 at 17:47

Hadi Brais · Accepted Answer · 2018-08-27T21:35:02.123

I found the Intel Optimization Manual 2005 that covers both 32-bit and 64-bit NetBurst processors. Refer to Table C-8 on page C-17. According to the first comment on this blog post, the 32-bit Northwood's model is 02h and the 64-bit Nocona's model is 03h. The table shows that ADD/SUB/AND/OR/XOR have a throughput of 0.5 cycles on both processors, but a latency of 0.5 cycles on Northwood and 1 cycle on Nocona. This means that double-pumping is supported on Nocona, but only if the back-to-back instructions are not dependent. The rest of the table also shows that some instructions that were not double-pumped on Northwood were double-pumped on Nocona.

Summary: There is ample evidence that shows that some NetBurst-based processors (whether released or canceled) could perform at least 2 64-bit ALU operations per cycle using either 2 32-bit staggered ALUs or at least a single 64-bit staggered ALU (which would be enabled by smaller feature sizes such as 90nm at that time).

Figure 7 of the original paper¹ on Intel Pentium 4 Willamette² processor discusses how the double-pumped³ ALU works in some detail (at the logic design level).

The figure shows a single 32-bit staggered ALU unit. This confirms that the ALU can perform two fully dependent (both input operands are dependent) simple ALU operations in three fast cycles (where a fast cycle is one half of the main clock cycle). The result of the operation itself is available after 2 fast cycles (1 main cycle), but the new flags are only available after the third fast cycle (1.5 main cycles). Note that there are two such ALUs on ports 0 and 1, both are staggered. So the design could execute 2 dependency ALU chains with 4 operations per slow cycle throughput.

That paper was published in 2001. Intel has published another paper⁴ in 2005 that discusses in great detail at the circuit level how the staggered integer core in the Intel Pentium 4 Prescott⁵ processor. It's not clear to me whether the paper discusses the 64-bit version of Prescott or the 32-bit version. However, this paper clearly states that the staggered ALU units can only perform additions, Boolean operations, shifts, and rotations (the other paper discussed the design of pre-Prescott cores in which the two fast ALU units did not support shifting and rotating). The other important difference is this statement from the paper:

There are two distinct 32-bit FCLK execution data paths staggered by one clock to implement 64-bit operations.

So it seems that the two fast ALU units on ports 0 and 1 are staggered together, enabling 64-bit fast integer operations such as additions. Therefore, the design could execute either two 32-bit dependency ALU chains with 4 operations per slow cycle throughput or one 64-bit dependency ALU chain with 2 operations per slow cycle throughput. This is even more powerful than a single staggered 64-bit ALU that can do only 64 bit operations, not 32-bit ones. The is most probably the design used in the 64-bit variants of the NetBurst microarchitecture.

Another⁶ paper⁷ from Intel confirms that Intel was indeed able to design a double-pumped 64-bit ALU. I quote from the paper:

In this paper, we describe a single-cycle integer ALU fabricated in 90nm dual-Vt CMOS technology operating at 4GHz in the 64b mode, with a 32b mode latency of 7GHz (measured at 1.3V, 25◦C).

The paper doesn't mention whether this design has actually being used in any particular processor. But considering that the paper was published in 2004, there is a good chance that all of the 64-bit NetBurst cores (whether released or canceled) used the design.

There are many 64-bit NetBurst-based processors that have released by Intel. For example, see this list for the server-grade processors. One of the cores is called Nocona. There is some experimental evidence that the design mentioned earlier (2 staggered 32-bit ALUs) was actually used in Nocona. Refer to these slides used in some course taught in CMU in 2008 on code optimization. The slides compare between the performance of Nocona (64-bit NetBurst), Intel Core (also 64-bit), and AMD Opteron (also 64-bit and apparently implements the same 64-bit staggered ALU design). This is the code used in a loop:

x = x + d[i];

where all elements are 32-bit integers (unfortunately, 64-bits have not been used).

On slide 35, you can see the 32-bit integer addition throughput achieved on Nocona and Opteron. Since each operation requires a load and Nocona only supports a single load per cycle, Nocona's performance maxed out at around 1 operation per cycle. Opteron, however, which supports two loads per cycle, was close to the theoretical maximum of 2 operations per cycle. This experiment of course does not take advantage of staggering, but only of the fact that there are two 32-bit simple ALUs.

However, later in the slides, SSE3 is used instead of scalar integer registers. The results for all of the three processors are shown on slide 44. With SSE3, there will be only one 128-bit load per 4 elements. Nocona can perform a 64-bit load from the L1D per cycle (see the article cited below), while Core can perform a single 128-bit L1D load per cycle. However, Core has a feature called Advanced Digital Media Boost (ADMB) that enables it to perform 4 32-bit addition per cycle. That same paper also mentions that pre-Core architectures supported only 2 32-bit SSE3 ALU operations per cycle. But if there are two 32-bit staggered ALUs in Nocona, the low SSE3 throughput implies that an SSE3 operation makes use of only one of the staggered ALUs. ADMB can be implemented in two ways. Either by expanding each ALU to 64-bits and keeping them staggered and utilizing both ALUs to perform 2 64-bit ALU operations per cycle. Another possibility is expanding each ALU to 128-bit and eliminate staggering.

There is a patent filed by Intel in 1998 and granted in 2001 on the staggered execution of an instruction, any instruction basically, not just ALU operations. That patent is still active. There is a lot of discussion there on how staggered execution can be useful for 128-bit SIMD instructions. Based on this patent, it's very possible that Intel Core uses two 64-bit staggered ALUs to achieved its throughput. Each of the 64-bit ALUs can actually be made using two staggered 32-bit ALUs shown in the figure above.

In 2002, Intel filed a patent for a generic staggered ALU design. It was generic in the sense that it was not about any specific ALU operation or the number of clock cycles or the clock period. The interesting thing here is that one of the figure there shows a staggered 64-bit ALU design! That was in 2002. The patent also discusses some of the challenges in designing staggered ALUs.

The patent says that it was both granted and abandoned on the same day in 2006. Then after few months, another identical patent application was filed.

This article shows that Potomac (another server-grade Pentium 4) is 64-bit architecture and supports 4 64-bit per cycle. Yamhill and Jayhawk were canceled by Intel. (There is an error in the article: Nocona is a 64-bit CPU.)

(1) In case the link goes down, the paper is titled "The Microarchitecture of the Pentium® 4 Processor" and authored by Glenn Hinton, et al.

(2) Also known as the first-gen Pentium 4.

(3) Also known as staggered ALU.

(4) In case the link goes down, the paper is titled "Low-Voltage Swing Logic Circuits for a Pentium® 4 Processor Integer Core" and authored by Daniel J. Deleganes, et al.

(5) Also known as the third-gen Pentium 4.

(6) In case the link goes down, the paper is titled "A 4GHz 300mW 64b Integer Execution ALU with Dual Supply Voltages in 90nm CMOS" and authored by Sanu K. Mathew, et al.

(7) In case the link goes down, the paper is titled "HIGH-PERFORMANCE ENERGY-EFFICIENT DUAL-SUPPLY ALU DESIGN" and authored by Sanu K. Mathew, et al.

I feel like this answer could use a summary off the top with the actual answer, since even for me it was hard to extract it, and a casual user probably doesn't have much chance. I _think_ the edit changed the answer from "it's unclear" to "yes, it probably did execute back-to-back 64-bit operations in half a cycle, at least asymptotically for long dep chains". — BeeOnRope, Aug 27 '18 at 16:16
@BeeOnRope I think I found a conclusive answer to the question. Although I wonder why double pumping was removed later. — Hadi Brais, Aug 27 '18 at 21:36
Thanks. You say "This means that double-pumping is supported on Nocona, but only if the back-to-back instructions are not dependent" - but to me, from a user point of view, not "double pumping" since the whole difference between double pumping and just having two ALUs is the performance of dependent operations. At least this aligns with what I've heard from people who remember those chips (that the 0.5 cycle latency disappeared with the 64-bit chips). Perhaps internally there is some double pumping going on but it doesn't pay off in reduced latency. — BeeOnRope, Aug 27 '18 at 22:19
@BeeOnRope Yeah I think Nocona basically gives the illusion that there are 4 ALUs on 4 different ports, each with 1 cycle latency, but using double pumping significantly simplifies the design of the pipeline and reduces area overhead compared to actually having 4 ALUs on 4 ports. — Hadi Brais, Aug 27 '18 at 22:25

Was there a P4 model with double-pumped 64-bit operations?

1 Answers1

Linked