Why are the x86 bit-string manipulation instructions slow with a memory destination? (BTS, BTR, BTC)

Question

Agner finds that the x86 bit manipulation instructions (btr bts btc, no lock) applied to a memory operand are slower than other read-modify-write instructions (like add, xor, etc.) on most processors where they are supported. Why is this? The instructions seem quite straightforward to implement.

Is it because the address actually loaded from is not the same as that specified by the memory operand, and this confuses some frontend mechanism for tracking memory accesses? This seems plausible, but I wouldn't expect it to affect throughput (at least, not by so much); only latency.

Peter Cordes · Answer 1 · 2023-01-28T17:27:28.210

Is it because the address actually loaded from is not the same as that specified by the memory operand

Yes, pretty clearly that's the thing that separates it from a memory-destination shift.

The reg-reg version is 1 uop with 1 cycle latency on Intel, running on execution ports 0 or 6 on Intel Haswell and later for example, same as shifts. (Decoding an index to a 1-hot mask is cheaper than a general shifter, but since there are shift units presumably Intel just uses those.)

AMD for some reason runs bts reg,reg as 2 uops, slower than simple shifts. IDK why, maybe something about the FLAGS setting.

bts mem, imm8 is also pretty normal, 3 front-end uops on Intel. xor mem, imm8 is only 2 front-end uops, but that's because it can micro-fuse the load+xor. not mem is 3 front-end uops, only micro-fusing the store-address and store-uop instructions.

and this confuses some frontend mechanism for tracking memory accesses?

No. The front-end doesn't track memory accesses, that's the back end.

It's partly slow because it's implemented as multiple uops; that hurts even when you do one surrounded by different instructions. On Intel Haswell and Alder Lake (and probably all in between), it's 10 front-end uops for bts mem, r32, vs. 3 for bts mem, imm8

Since it can't use the usual address-generation hardware directly, it's implemented in microcode as multiple uops, presumably something like LEA into a temporary from the normal addressing mode, and adding (bit_index>>6) * 4 to that to index by dwords or something like that. Oh, maybe the reason it's 10 uops is that it always wants to access the aligned dword containing the bit, not just a multiple-of-4 offset from the address in the [] addressing mode for something like [rax + rdx*4 + 123].

Doing it manually is more efficient for the normal case where you know the start of the bitstring is aligned, so you can shr the bit-index to get a dword index for load / bts reg,reg (1 uop) / store. That takes fewer uops than bts [mem], reg. Note that bts reg,reg truncates / wraps the bit-index, so if you arrange things correctly that modulo comes for free. For example a Sieve of Eratosthenes. Also How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

But Agner Fog and https://uops.info/ both measure a throughput of 5 cycles on Haswell / Alder Lake P-cores, significantly lower than the front-end bottleneck (or any per-port back-end bottleneck) would account for.

I don't know what accounts for that. The actual load and store uops should just be normal, with inputs coming from internal temporary registers but still a normal load uop and store uop as far as the addresses in the store buffer and load buffer are concerned. (Together, Intel calls that a Memory order buffer = MOB.)

I don't expect it to be a special case of memory-dependency prediction since that happens when a load uop executes (and there are previous store-address uops not yet executed, so the addresses are some previous stores are still unknown.)

TODO: run some experiments to see what if any other instructions mixed in with bts mem,reg will slow it down, competing for whatever resource it bottlenecks on.

It doesn't look like a benchmarking error on the part of https://uops.info/ (e.g. using the same address every time and stalling on store-forwarding latency). Their testing included some unrolled sequences using different offsets. e.g. Haswell throughput testing for bts m64, r64 measured 6.02 or 6.0 cycle throughput with the same address every time (bts qword ptr [r14], r8), or an average of 5.0 cycles per BTS when unrolling a repeated sequence like bts [r14],r8 ; bts [r14+0x8],r8 ; ... ; bts [r14+0x38],r8. Even for a sequence of 16 independent instructions covering two adjacent cache lines, it was still the same 5 cycles per iteration.

'The front-end doesn't track memory accesses, that's the back end' -- I thought memory disambiguation prediction was handled there? — Moonchild, Jan 28 '23 at 10:04
Why would it do a dword access rather than a byte access? That gets rid of all the complication with alignment (or, rather, shifts it to the memory subsystem, where it needed to be handled anyway). I would expect, at most, 2 uops to calculate the address, and then 1 each to load, operate, and store. — Moonchild, Jan 28 '23 at 10:07
@Moonchild: The memory-order buffer (load+store buffers) and uop scheduling are all in the back-end. The front-end is the in-order part before uops issue into the ROB + RS (aka scheduler). (https://www.realworldtech.com/haswell-cpu/6/) That happens regardless of whether a load is dependent on a previous store or not (and yes that can be dynamically predicted if addresses of older stores aren't all known. That happens in a load execution unit, part of the back end.) — Peter Cordes, Jan 28 '23 at 16:14
@Moonchild: Good question about the true width of the memory access. It would be an interesting experiment to do byte stores near the byte containing the bit and look for store-forwarding stalls. The manual for `bt` https://www.felixcloutier.com/x86/bt says "When accessing a bit in memory, the processor may access 4 bytes starting from the memory address for a 32-bit operand size". "It may do so even when only a single byte needs to be accessed to reach the given bit" (which is always.) Also, I was wrong, it doesn't say it accesses an aligned word or dword, and warns about using near holes. — Peter Cordes, Jan 28 '23 at 16:21
@Moonchild: A possible reason for using a dword access is that the ALU may only implement 16/32/64-bit operand-size for this operation, e.g. masking the shift-count to 4 or 5 bits within a word or dword instead of truncating to the low 3 within a byte. — Peter Cordes, Jan 28 '23 at 16:23
I mean, I can imagine all sorts of ways to implement it slowly; but _why_ did they implement it thus? It certainly doesn't seem prohibitive to also support 8-bit operands internally. — Moonchild, Jan 29 '23 at 06:31
@Moonchild: Good question. Back in the early 90s when P6 was new, the RISC philosophy was strong with some of the designers, including Andy Glew by his own account. "Let software or microcode handle it" was their attitude to several things. Keep in mind that having a dedicated AGU for bitstring instructions would require muxing to select that or a regular AGU, so it would make critical paths longer (in gate delays). And before Haswell, no uop could have more than 2 inputs, so it would have to at least LEA first for indexed addressing modes. My guess is the design's never been revisted. — Peter Cordes, Jan 29 '23 at 07:20

Why are the x86 bit-string manipulation instructions slow with a memory destination? (BTS, BTR, BTC)

1 Answers1