What is instruction fusion in contemporary x86 processors?

Question

What I understand is, there are two types of instruction fusions:

Micro-operation fusion
Macro-operation fusion

Micro-operations are those operations that can be executed in 1 clock cycle. If several micro-operations are fused, we obtain an "instruction".

If several instructions are fused, we obtain a Macro-operation.

If several macro-operations are fused, we obtain Macro-operation fusing.

Am I correct?

You really need to get familiar with Agner Fog's optimization manuals, especially the [_microarchitecture_](https://www.agner.org/optimize/microarchitecture.pdf) one. Search the document for "Macro-op fusion" and "Micro-op fusion" for the CPU you're interested in. Broadly, the difference is that in macro-op fusion two instructions are fused in one micro-op (e.g. `dec` & `jne` fuse into a single decrement-and-conditional-branch), while micro-op fusion involves handling multiple micro-ops together that really "belong" together, especially for write and read-modify-write instructions. — Iwillnotexist Idonotexist, Jun 02 '19 at 09:21

Peter Cordes · Accepted Answer · 2023-06-21T17:09:27.433

20

No, fusion is totally separate from how one complex instruction (like cpuid or lock add [mem], eax) can decode to multiple uops. Most instructions decode to a single uop, so that's the normal case in modern x86 CPUs.

The back-end has to keep track of all uops associated with an instruction, whether or not there was any micro-fusion or macro-fusion. When all the uops for a single instruction have retired from the ROB, the instruction has retired. (Interrupts can only be taken at instruction boundaries, so if one is pending, retirement has to find an instruction boundary for that, not in the middle of a multi-uop instruction. Otherwise retire slots can be filled without regard to instruction boundaries, like issue slots.)

Macro-fusion - between instructions

Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. (Intel and AMD CPUs). The rest of the pipeline sees it purely as a single uop¹ (except performance counters still count it as 2 instructions). This saves uop cache space, and bandwidth everywhere including decode. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1 / or dst,src2 makes sense.

Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add/sub or inc/dec + JCC with some conditions. (x86_64 - Assembly - loop conditions and out of order)

Ice Lake² changed to doing macro-fusion right after legacy decode, so pre-decode only has to steer 1 x86 instruction to each of the four decoders.

Micro-fusion - within 1 instruction

Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline. But they still have to dispatch separately to separate execution units. And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths)

P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.

And Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. See Micro fusion and addressing modes

Both can happen at the same time

    cmp   [rdi], eax
    jnz   .target

Tested on i7-6700k Skylake, probably applicable to most earlier and later Sandybridge-family CPUs, especially before Ice Lake.

The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from [rdi] can micro-fuse with that uop.

Failure to micro-fuse the cmp does not prevent macro-fusion.

The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword [static_data], 1 / jnz can macro-fuse but not micro-fuse.

A cmp/jcc on SnB-family (like cmp [rdi+rax], edx / jnz) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz). You can verify this with perf counters by putting a mov ecx, 1 in between the CMP and JCC vs. after, and note that uops_issued.any:u and uops_executed.thread both go up by 1 per loop iteration because we defeated macro-fusion. And micro-fusion behaved the same.

On Skylake, cmp dword [rdi], 0/jnz can't macro-fuse. (Only micro-fuse). I tested with a loop that contained some dummy mov ecx,1 instructions. Reordering so one of those mov instructions split up the cmp/jcc didn't change perf counters for fused-domain or unfused-domain uops.

But cmp [rdi],eax/jnz does macro- and micro-fuse. Reordering so a mov ecx,1 instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).

cmp [rdi+rax], eax/jne only macro-fuses; not micro. (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, [rdi+rax] that can keep indexed addressing modes micro-fused. That sub with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell).

(The cmp dword [rdi],0 does micro-fuse, though: uops_issued.any:u is 1 lower than uops_executed.thread, and the loop contains no nop or other "eliminated" instructions, or any other memory instructions that could micro-fuse).

Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. TODO: check whether gcc and clang's choices are optimal with immediate vs. register.

Micro-operations are those operations that can be executed in 1 clock cycle.

Not exactly. They take 1 "slot" in the pipeline, or in the ROB and RS that track them in the out-of-order back-end.

And yes, dispatching a uop to an execution port happens in 1 clock cycle and simple uops (e.g., integer addition) can complete execution in the same cycle. This can happen for up to 8 uops simultaneously since Haswell, but increased to 10 on Sunny Cove. The actual execution might take more than 1 clock cycle (occupying the execution unit for longer, e.g. FP division).

The divider is I think the only execution unit on modern mainstream Intel that's not fully pipelined, but Knight's Landing has some not-fully-pipelined SIMD shuffles that are single uop but (reciprocal) throughput of 2 cycles.).

Footnote 1 - does a macro-fused uop ever need to split?

If cmp [rdi], eax / jne faults on the memory operand, i.e. a #PF page fault exception, it's taken with the exception return address pointing to the start of the cmp, so it can re-run after the OS pages in the page. That Just Works whether we have fusion or not, nothing surprising.

Or if the branch target address is an unmapped page, a #PF exception will happen after the branch has already executed, from code fetch with an updated RIP.

But if the branch target address is non-canonical, architecturally the jcc itself should #GP fault. e.g. if RIP was near the top of the canonical range, and rel32=+almost2GiB. (x86-64 is designed so RIP values can literally be 48-bit or 57-bit internally, never needing to hold a non-canonical address, since a fault happens on trying to set it, not waiting until code-fetch from the non-canonical address.)

If CPUs handle that with an exception on the jcc, not the cmp, then sorting that out can be deferred until the exception is actually detected. Maybe with a microcode assist, or some special-case hardware.

Also, single-stepping with TF=1 should stop after the cmp.

As far as how the cmp/jcc uop goes through the pipeline in the normal case, it works exactly like one long single-uop instruction that both sets flags and conditionally branches.

Surprisingly, the loop instruction (like dec rcx/jnz but without setting flags) is not a single uop on Intel CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?.

Footnote 2: Ice Lake changes

Agner Fog found that macro-fusion happens after the legacy decoders. (Micro-fusion of course still in the decoders so instructions like add eax, [rdi] can still decode in a "simple" decoder.)

Hopefully the upside here is not ending a decode group early if the last instruction is one that could maybe macro-fuse, which is IIRC something earlier CPUs do. (Lower legacy-decode throughput for a big unrolled block of sub instructions vs. or instructions when no JCC is involved. Earlier CPUs couldn't macro-fuse or with anything. This only affected legacy decode, not the uop cache.)

Wikichip incorrectly reports that ICL can only make one macro-fusion per clock cycle, but testing on Can two fuseable pairs be decoded in the same clock cycle? confirms Rocket Lake (same uarch backported to 14nm) can still do 2/clock like Haswell and Skylake.

One source reports that Ice Lake can't macro-fuse inc or dec/jcc (or any instruction with a memory operand), but Agner Fog's table disagrees. uiCA shows dec/jnz at the bottom of a loop macro-fusing, and their paper shows its predictions agree well with testing on real CPUs including on ICL. But if they compiled with recent GCC, they might not have tested any dec/jcc loops, sub/jcc. Agner's ICL fusion table isn't a copy/paste of earlier SnB; it shows inc/dec can fuse in the same cases as add/sub now (which surprisingly includes with jc/ja now, but dec doesn't modify CF.) If anyone could test this to verify, that'd be great.

update: Noah's testing on a Tiger Lake shows dec/jnz at the bottom of a loop can macro-fuse. And that dec/jc doesn't appear to macro-fuse.

Microcode version: 0x42. decl; jnz loop still macrofuses (niters = nissued_uops = nexecuted_uops = cycles = {expected_ports}).

Couldn't get decl; jc to macrofuse. For decl; jc I setup two loops: subl $1, %ecx; decl %eax; jc loop (where ecx was a loop counter). niters * 3 uops issued/executed.
Also tried just carry-flag unset and decl %eax; jc done; jnz loop, also 3 * niters uops.

It's likely that Ice Lake behaves the same as Tiger Lake, since it didn't make major microarchitectural changes.

edited Jun 21 '23 at 17:09

answered Jun 02 '19 at 09:50

Peter Cordes

328,167
45
605
847

1

@Hadi: I simplified your edit. I don't think an exception can ever be taken with the exception-return address pointing to the JCC. Or if it can, that's a very special case that can be handled specially. Anyway, I did some testing that I'd been meaning to write up, and on SKL `cmp dword [rdi], 0 / jnz` can't *macro*-fuse, I guess because of the immediate operand. – Peter Cordes Jun 02 '19 at 20:03
"The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion." --- Do you have any pointers about the way the retirement stage figures out that all the uops for a single instruction have retired? – ricpacca Jul 11 '19 at 01:06
1

@ricpacca: not exactly. A good mental model is that the ROB is a circular buffer, written in order by the issue stage, and read in order by the retirement stage. Each entry (a single uop, possibly micro-fused) has a flag that indicates whether it's completed (ready to retire) or not. And I guess also a "start of new instruction" flag, or a RIP field that changes for a new instruction, or whatever. The exact mechanism isn't relevant for performance; the retirement stage just retires uops in groups of 4 or 8 per thread per cycle, or whatever the retirement bandwidth is. – Peter Cordes Jul 11 '19 at 03:55
There are plenty of FP instructions that are not fully pipelined on modern Intel, in the sense of accepting a new operation every cycle. – BeeOnRope Jul 12 '19 at 03:43
@BeeOnRope: I said execution units (uops), not instructions. On SKX, the only not-fully-pipelined single-uop instructions I see on Agner Fog's table are div/sqrt (on the divider execution unit), and also strangely `VFIXUPIMMPS/PD`. The other instructions with worse than 1c throughput are all multi-uop. – Peter Cordes Jul 12 '19 at 04:00
Well, we can mostly only guess at what EUs exist by looking at port usage and applying some reasoning, but sure, let's talk about uops. So I am talking about things like `fsqrt` (first one I came across) that is a single uop yet has throughput of 4-7 cycles. Note that multi-uop instructions can also be shown to not be pipelined: if something takes 3 uops but 10 cycles throughput you can be sure some part is not pipelined. – BeeOnRope Jul 12 '19 at 20:48
@BeeOnRope: div/sqrt instructions all run on the same divider execution unit on port 0, which is special in not being fully pipelined. That's why IACA has a special footnote for div/sqrt instructions vs. other uops for other ports. (Agreed about instructions where the throughput is worse than the number of uops / ports_involved; there might be some I overlooked. Integer shift instructions like `shl r32, cl` are falsely listed at 2c throughput by Agner because he didn't break the flag dependency. You can speed up shifts by interleaving an xor-zeroing to break the FLAGS dep.) – Peter Cordes Jul 13 '19 at 02:36
@PeterCordes - how do you know FP `sqrt` runs on the same unit as integer divide? Some of the `p0` FP stuff is all 1 uop, but yeah the latency and throughput seems similar. Do you think the other uops are just adjustments specific to integer divide that don't need to happen for the FP stuff? – BeeOnRope Jul 13 '19 at 17:43
1

@BeeOnRope: Intel calls it the div/sqrt unit. `arith.divider_active` - "Cycles when divide unit is busy executing divide or square root operations". A throughput bottleneck on `sqrtss` keeps firing that event basically every cycle. Divide and square root are both calculated with a similar iterative process which is why they can usefully share an execution unit, and why they have very similar performance characteristics running on that EU. Combined div/sqrt units are normal: [How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?](//stackoverflow.com/q/54642663) – Peter Cordes Jul 13 '19 at 17:57
Thanks Peter, got it! – BeeOnRope Jul 13 '19 at 20:17
I think you are wrong / inexact in the part you said "Micro-fusion stores 2 uops **from the same instruction**", because Haswell (for example) can decode up to 5 instructions per cycle while it can only produce up to 4 fused-uops per cycle (so it would be a bottleneck if what you said in that line were true). And a common pair of instructions is "cmp and JCC", they are 2 instructions but they are after fused in 1 fused-uop. – isma Jul 25 '20 at 09:56
@isma: That's *macro*-fusion. Micro-fusion is the other one. Read my answer, explaining that difference is literally the point of it, and the first bolded sentence. – Peter Cordes Jul 25 '20 at 09:59
But this is contradicting the Patterson's concept of fused-uops! Read http://www.lighterra.com/papers/modernmicroprocessors/ from the point that sais "processors often group or "fuse" μops into common pairs where possible for ease of tracking ...". He is giving the same examples I gave you in my last comment and is talking about it as "uop" not as "MOP" – isma Jul 25 '20 at 10:10
@isma: It's not contradicting anything. Intel has 2 kinds of fusion, macro (multiple instructions together), and micro (multi-uop instructions go through the front-end more efficiently). AMD for example doesn't have micro-fusion; memory-source ALU instructions are always a single uop. (Or no extra, if the ALU instruction was already multi-uop). – Peter Cordes Jul 25 '20 at 10:19
@isma: Nothing says that Intel has to use Patterson's terminology. But in fact it doesn't even conflict, micro-fusion is just one category. The "micro" in "micro-fusion" is unrelated to Intel micro-ops (uops) vs. AMD macro-ops (mop); that's just AMD bragging about getting more work done with each op (especially in the early days before Intel CPUs could do any micro-fusion, so Intel CPUs did decode `add eax, [edi]` to more ops than AMD, and same for stores.) BTW, a RISC ISA doesn't need micro-fusion: every single instruction is simple enough to be 1 op, so Patterson might not mention it. – Peter Cordes Jul 25 '20 at 10:20
TL:DR: you're inventing a non-problem because you're getting hung up about some terminology that's specific to Intel CPUs. – Peter Cordes Jul 25 '20 at 10:24
But Patterson is giving cmp-jcc as an example of fused-uop (**micro**) and you are giving cmp-jcc as an example of fused-MOP (**Macro**) !! – isma Jul 25 '20 at 10:24
I have just seen your new comment (was published when I was writing last mine), so ignore my last comment because it has already been answered. It's just a conflict on terminology I see (Patterson uses one terminology and Intel another one a little ""different"", and you were talking about Intel's one). Thanks – isma Jul 25 '20 at 10:31
@isma Note that maybe you're still not understanding my 2nd point about terminology. It's *not* a conflict. Macro-*fusion* makes 1 uop out of 2 instructions on Intel CPUs. The normal or max size of an op (micro vs. macro) is totally unrelated to the type of fusion. In Intel's terminology, macro-fusion means decoding cmp/jcc as 1 uop. In AMD Bulldozer terminology, just plain "fusion" is decoding cmp/jcc as 1 m-op. In AMD Zen terminology, just plain "fusion" is decoding cmp/jcc as 1 uop. Also yes, since this is an x86 question specifically about Intel CPUs, the answer uses Intel terminology. – Peter Cordes Jul 25 '20 at 10:44
definitely what makes me not understand is that, for example, you say "Macro-fusion makes 1 uop out of 2 instructions on Intel CPUs.". In my head, macro-fusion makes 1 **MOP** out of 2 instructions. And, after, those are converted uops (which also are fused-uops). I mean, I thought that what Intel does is: create 1 MOP from 1 or 2 instructions. And, after, create uop-fusions from the MOPs. Each fused-uop is coming from the same MOP (but not instruction. Since a MOP can be 2 instructions). What is exactly what I have wrong in my concepts about uop and MOP? – isma Jul 25 '20 at 10:56
@isma: Then in your head you're inventing a meaning for "macro-fusion" that isn't how Intel defines the term. Technical terminology means what its technical definition says, regardless of the English meaning of the words. But even then, I don't see a problem. The "macro" is just referring to across-instructions (large scale), as opposed to within one instruction. Talking about what's getting fused, not the size of the thing they're fused into. – Peter Cordes Jul 25 '20 at 11:37
Intel CPUs don't have MOPs or anything like you describe. The decoders make all micro + macro fusions at once. The only 2nd step is *expanding* micro-fused uops in the unfused domain of the back-end (scheduler (aka RS) + execution units). – Peter Cordes Jul 25 '20 at 11:40
Wow Peter, do you know what was creating my confusion? The wikichip's block diagrams!! (for example Haswell one). That made me think that instr -> MOP -> uop were done in different stages. But thanks to another person with the same doubt (https://community.intel.com/t5/Software-Tuning-Performance/Macro-fusion-merges-two-instructions-into-a-single-micro-op/td-p/1139690) and thanks to your answer I've finally understood the concept! – isma Jul 25 '20 at 11:48
Wow, https://en.wikichip.org/w/images/c/c7/haswell_block_diagram.svg looks weird to me. I've never seen anyone call the un-decoded chunks of x86 machine code "MOP"s. It's actually plausible that macro-fusion does happen in length-finding pre-decode, though, so it can send the 2-instruction sequence to a single decoder as one instruction, and send later instructions to later decoders without leaving a gap. I'd guess that's accurate. I normally refer to David Kanter's diagrams, e.g. https://www.realworldtech.com/haswell-cpu/2/ shows "6 instructions" going from fetch/pre-decode to decode. – Peter Cordes Jul 25 '20 at 11:57
Only one last thing: when you talked about macro-fusion creates 1 uop from 2 instructions (for example test + JCC), that uop isn't considered to be as a fused one, isn't it? I mean, in the RS won't be needed to defuse that uop in 2 uops, right? – isma Jul 25 '20 at 12:03
1

@isma: Right, unlike micro-fusion, macro-fusion doesn't re-expand later. It's a plain single uop. Pretty sure my answer here already says this; if not let me know. – Peter Cordes Jul 25 '20 at 12:05
There is a new thing that intrigues me. uops in Intel have a fixed length, but that length value is not published. Let's supose 6Bytes. A fused-uop, like you said, will occupy only one entry in the ROB, but that fused-uop will occupy 6Bytes? Or will it occupy 12Bytes? I mean, does the micro-fusion merge 2 uops so that it creates a uop which is twice the length of an unfused-uop? Or are they merged so that that fused-uop is the same length as an unfused-uop? – isma Jul 29 '20 at 10:05
@isma: Why would you assume something as tiny as 6 bytes? That couldn't even hold the 8-byte immediate of a `mov rax, 0x123456789abcd`, let alone the opcode and register number. Internal uops are known to be fairly big, although Sandybridge streamlined them some especially in the back-end (with un-lamination of indexed addressing modes as a downside). But anyway, it probably makes sense to think of a uop as a `union{}` where the same bits could mean different things depending on some earlier type bits. Some simple uops might not truly use as many bits, but they still take fixed-width slots. – Peter Cordes Jul 29 '20 at 14:15
@isma: The number of bits needed for a uop certainly doesn't double with micro-fusion o rmacro-fusion; if that was the case then it would be a pretty inefficient internal design, making every ROB entry twice as big as it could have been. More likely it just takes a few extra bits to indicate that *both* parts of a uop are used, both ALU operation and addressing mode (for memory-source ALU). So maybe there are some fixed bit-positions within uops, not fully a union. The uop format doesn't have to be fixed between uop-cache vs. ROB, some decoding from uop cache is ok, but not ROB. – Peter Cordes Jul 29 '20 at 14:20
Yes, it seems logic, but I disagree with you in one little thing: in the uop-cache there may be fused-uops and "simple" uops, isn't it? So, having this in mind, I thing that both of them will occupy the same number of bytes. I mean, I think there is only one fixed size of uop which is used for both the uop cache and the ROB entries (since in both places there may be a fused-uop or a simple one). If it is "a simple uop" I think some bits will be wasted. And I think that, after, the RS receives a tinier uop size, where all the uops are "simple (unfused) uops". What do you think about this? – isma Jul 29 '20 at 14:31
@isma: only up-to-6 uop-cache entries are read every clock cycle, by one block of logic which can expand them into a larger but simpler format if necessary. It's a large cache (~1536 entries), vs. a 224 entry ROB which also need bits for completed or not, and to associate with unfused-domain uops. We already know that a uop with more than 32 bits of immediate + displacement data can borrow some space from other uops in the same uop-cache line, but of course nothing like that would happen in the ROB. (Agner Fog's microarch pdf https://agner.org/optimize/ Sandybridge section.) – Peter Cordes Jul 29 '20 at 14:38
@isma: Also, in the front-end, register numbers are architectural regs (rax..r15). In the back-end, reg numbers are physical regs. There's very good reason to expect that the actual bit layout of a uop is simpler but perhaps larger in the ROB than in the uop cache. – Peter Cordes Jul 29 '20 at 14:39
@isma: Yes, the RS probably has a different format, too; it has to track which execution port it's been allocated to (in the issue/rename/alloc stage), and doesn't have to track micro-fusion. It does have to track which ROB entry this unfused-domain uop is associated with, and maybe something about where its inputs are coming from so the scheduler can scan to detect when all are ready. It still also has to be able to hold a `mov reg, imm64` so it can't be too small. – Peter Cordes Jul 29 '20 at 14:42
It seems very logical what you said. I completely agree. Above all, I agree that a fused-uop does not occupy twice as much as a "simple" uop. Mainly I wanted to see what you thought about a fused-uop occupying the double as a "simpler uop" since maybe you did see some reason for it to occupy double (I didn't see any), but I'm happy to know that we both agreed that this would be a big waste Bytes. Completely agree. Thanks for your opinion. – isma Jul 29 '20 at 14:52
@PeterCordes Can only 1 microfused instruction be decoded in a cycle (i.e 2-1-1-1) or are they split earlier? – Noah Dec 16 '21 at 00:06
@Noah: I'm pretty sure they're not split until issue/rename into the RS, and my test-case in https://www.agner.org/optimize/blog/read.php?i=415#857 achieves 7 unfused-domain uops per clock with 3 micro-fused uops. That's running from the DSB. (Or maybe LSD, given that was in 2017). Manual unroll or the JCC erratum on SKL could test whether MITE can do that, but I'd be surprised if there was any limitation. I think the "simple" decoders can just produce 1 micro-fused uop for [any 1-uop instruction the simple decoders can handle at all](https://stackoverflow.com/q/61980149/224132). – Peter Cordes Dec 16 '21 at 00:26
@PeterCordes so 1-uop for the decoders is in the fused-context? – Noah Dec 16 '21 at 01:18
1

@Noah: Yeah, everything is fused-domain until the RS and execution units. (fused/unfused domain refers to micro-fusion, not macro-fusion; macro-fusion is a separate thing and happens as instructions are routed to decoders in pre-Ice Lake, or apparently after decode in Ice Lake. Although IIRC there is some interaction between fusion if the cmp/test has an immediate and/or RIP-relative or something.) – Peter Cordes Dec 16 '21 at 01:21
@Noah: If you have an Ice Lake and some spare time, can you test if `dec`/`jnz` still macro-fuse? I came across a claim (https://www.corsix.org/content/x86-macro-op-fusion-notes) that ICL doesn't macro-fuse inc/dec, unlike SKL. Also claims it removes fusion for instructions with memory operands. (See updates to my answer. Agner Fog and uiCA both say `dec/jnz` loops can still macro-fuse. Also, Agner Fog says `dec`/`jc` can surprisingly macro-fuse, leaving CF unmodified but branching on it.) Hmm, I wonder if a microcode update could have changed it between 2019 and 2021? – Peter Cordes Jun 20 '23 at 05:46
1

@PeterCordes Only have [TGL](https://www.intel.com/content/www/us/en/products/sku/213799/intel-core-i711850h-processor-24m-cache-up-to-4-80-ghz/specifications.html) at the moment. Microcode version: `0x42`. `decl; jnz` loop still macrofuses (niters = nissued_uops = nexecuted_uops = cycles = {expected_ports}). Couldn't get `decl; jc` to macrofuse. For `decl; jc` I setup two loops: `subl $1; %ecx; decl %eax; jc loop` (where `ecx` was a loop counter). See niters * 3 uops issued/executed. Also tried just carry-flag unset and `decl %eax; jc done; jnz loop`, also 3 * niters uops. – Noah Jun 21 '23 at 16:37
Thanks. Sounds like https://www.corsix.org/content/x86-macro-op-fusion-notes is probably wrong, unless it was something fixed in TGL. – Peter Cordes Jun 21 '23 at 16:57
Not only a tgl fix, but also that icl/icx was affected by a recent microcode update. Otherwise i strongly suspect the article is incorrect. – Noah Jun 22 '23 at 06:55
1

@PeterCordes also for the sake of completness, also did the two carry test loops above but inverted carry. `jnc` instead of `jc`. Still see no macro-fusion. Also tested that `decl` macrfouses with signed-flag. Also note above: `subl $1; %ecx; decl %eax; **jc** loop` is backwards, it was `subl $1; %ecx; decl %eax; **jnc** loop`. – Noah Jun 22 '23 at 17:17