Is it because the address actually loaded from is not the same as that specified by the memory operand
Yes, pretty clearly that's the thing that separates it from a memory-destination shift.
The reg-reg version is 1 uop with 1 cycle latency on Intel, running on execution ports 0 or 6 on Intel Haswell and later for example, same as shifts. (Decoding an index to a 1-hot mask is cheaper than a general shifter, but since there are shift units presumably Intel just uses those.)
AMD for some reason runs bts reg,reg
as 2 uops, slower than simple shifts. IDK why, maybe something about the FLAGS setting.
bts mem, imm8
is also pretty normal, 3 front-end uops on Intel. xor mem, imm8
is only 2 front-end uops, but that's because it can micro-fuse the load+xor. not mem
is 3 front-end uops, only micro-fusing the store-address and store-uop instructions.
and this confuses some frontend mechanism for tracking memory accesses?
No. The front-end doesn't track memory accesses, that's the back end.
It's partly slow because it's implemented as multiple uops; that hurts even when you do one surrounded by different instructions. On Intel Haswell and Alder Lake (and probably all in between), it's 10 front-end uops for bts mem, r32
, vs. 3 for bts mem, imm8
Since it can't use the usual address-generation hardware directly, it's implemented in microcode as multiple uops, presumably something like LEA into a temporary from the normal addressing mode, and adding (bit_index>>6) * 4
to that to index by dwords or something like that. Oh, maybe the reason it's 10 uops is that it always wants to access the aligned dword containing the bit, not just a multiple-of-4 offset from the address in the []
addressing mode for something like [rax + rdx*4 + 123]
.
Doing it manually is more efficient for the normal case where you know the start of the bitstring is aligned, so you can shr
the bit-index to get a dword index for load / bts reg,reg
(1 uop) / store. That takes fewer uops
than bts [mem], reg
. Note that bts reg,reg
truncates / wraps the bit-index, so if you arrange things correctly that modulo comes for free. For example a Sieve of Eratosthenes. Also How can memory destination BTS be significantly slower than load / BTS reg,reg / store?
But Agner Fog and https://uops.info/ both measure a throughput of 5 cycles on Haswell / Alder Lake P-cores, significantly lower than the front-end bottleneck (or any per-port back-end bottleneck) would account for.
I don't know what accounts for that. The actual load and store uops should just be normal, with inputs coming from internal temporary registers but still a normal load uop and store uop as far as the addresses in the store buffer and load buffer are concerned. (Together, Intel calls that a Memory order buffer = MOB.)
I don't expect it to be a special case of memory-dependency prediction since that happens when a load uop executes (and there are previous store-address uops not yet executed, so the addresses are some previous stores are still unknown.)
TODO: run some experiments to see what if any other instructions mixed in with bts mem,reg
will slow it down, competing for whatever resource it bottlenecks on.
It doesn't look like a benchmarking error on the part of https://uops.info/ (e.g. using the same address every time and stalling on store-forwarding latency). Their testing included some unrolled sequences using different offsets. e.g. Haswell throughput testing for bts m64, r64
measured 6.02 or 6.0 cycle throughput with the same address every time (bts qword ptr [r14], r8
), or an average of 5.0 cycles per BTS when unrolling a repeated sequence like bts [r14],r8
; bts [r14+0x8],r8
; ... ; bts [r14+0x38],r8
. Even for a sequence of 16 independent instructions covering two adjacent cache lines, it was still the same 5 cycles per iteration.