use atomic operations on the PCIe host/device shared memory?

Question

Some PCIe devices (for example FPGA card) can expose segments of its physical memory via host's BARs and the host can access the memory region via the memory devices (on Linux, we can memory mapped the devices to virtual memory). I suppose the device itself could also access this part of memory through /dev/mem mapped mechanism if it runs Linux too.

One thing a program could do to the (virtual) memory is atomic operations such as "__atomic_sub_fetch" and that could be very useful when writing high performance code.

My question is what if the memory comes from the above PCIe shared memory (and mapped to user's virtual memory space)? Does the atomic operation still hold? I do not know if PCIe can guarantee the atomic-ness considering the atomic operations could come from both the host and the device's CPUs at the same time. If yes, how is its perf compare to the same atomic operation on the regular memory?

I have seen related question asked here, not direct answer. PCI Express BAR memory mapping basic understanding

Thanks a lot!

MesaGuy · Answer 1 · 2022-07-28T02:33:14.083

Essentially yes, but there are a lot of caveats. Some with severe performance penalties that make atomics usage questionable in many cases (though still useful for cases where other methods of synchronization would cost even more.)

The PCIe standard describes essentially three scenarios in which a host CPU/RootPort is involved (though it does not discuss the scenarios really, just the groupings), so I will fill in the blanks a bit and color between the lines here. And compare contrast that to the x86 Intel instruction set support. (I do not know what AMD's current situation is, but I would GUESS it to the same, but AMD may have done more or less.)

PCIe: The Atomics Operations Defined by the PCIe standard, define indirectly three cases:

CPU to Device Atomics. (The Memory is in the Device, access by the CPU).
Device to CPU/Memory Atomics. (The Memory is in the Host/CPU, accessed by device)
And Device to Device Atomics. (The Memory is in one device, accessed by another)

Those three cases are assumed as "obvious" by the Standard, but not really so obvious. The PCIe standard then defines three subsets of support for Root Ports or Endpoints within the standard.

a. Requester Support. (To initiate Atomic Operations). b. Completer Support. (To support reception and carrying out of Requested PCIe Atomics) c. Forwarding Support (To support forwarding Atomics from one Root Port to a different Root Port, in the role of handling Device to Device Atomics, where the two devices are not under the same Root Port. This also applies to forwarding capabilities in PCIe Switches, which must forward upstream or downstream even within a single Host Root Port dual device tree situation where a PCIe switch is present.

TO ACCOMPLISH: Device to HOST operations:

The device must support REQUESTER Capability for Atomic (and the ops).
The host must support the COMPLETER Capability for Atomics (and the ops).
Any PCIe switches involved must support the FORWARDING Capability. (But some support the forwarding, without having the Forwarding capability, but then you cannot tell if the switch really works or not programmatically.

HOST to Device operations (The OP's question.)

The HOST must support REQUESTER Capability for Atomics (and be so enabled)
The Device must support COMPLETER Capability for Atomics (and be so enabled)
Any switches should report the FOWARDER Capability, but some do foward with the Capability being advertised.

For a while now, Intel CPU's have supported the COMPLETER OPERATION only. They have not supported the FORWARDING or the REQUESTER capabilities (and those PCIe extended capability DevCap2 bits are not set. (Forwarding is not, the Requester Function does not have a DevCap2 bit.)

Thus for most Intel CPU, the only support DEVICE TO HOST Atomics. They do NOT support DEVICE to DEVICE atomics when the two devices are under different PCIe Root Ports (no FORWARDING SUPPORT), but can handle DEVICE to DEVICE when a PCIe switch is installed in the root port, that supports FORWARDING (most do) and has the two devices below that switch, bypassing Intel's lack of support. But given that all this is optional, in IMHO, you can really only count on DEVICE to HOST support, not device to device, nor Host to Device.

Intel CPU's support of PCIe ATOMICS in the Requester role has to date been absent. The Intel CPU's do NOT support PCIe ATOMICS Requester Role (only Completer Role). Because they do not support the PCIe ATOMICS Requester role, there is also no corresponding implementation of instructions that lead to the generation of PCIe FetchAdd, Swap, CmpAndSwap operational PCIe TLP from the "FSB" (e.g. QPI, UPI) to the PCIe via the Root Port and "FSB" logic (e.g the Bus Unit of the CPU).

That said, Intel x86 "legacy atomics" (NOT PCIe ATOMICS) are still supported, but NOT on PCIe. The PCISIG (PCIe standards body), at Intel and AMD's request, outlawed the support of LOCK on PCIe, except in one limited case. LOCK was supported on the predecesor PCI (not Express). LOCK on PCIe is supported only to be carried to PCI devices through a PCIe to PCI bridge, and in no other way. (PCIe standard). So while the old LOCK add, LOCK inc, etc primitives DO WORK with PCI devices, they do NOT work with PCIe devices!

The PCIe ATOMICS are atomic ON THE PCIe "bus" (link topology). The Intel Legacy LOCK based x86 "atomics" are atomic within the Coherency Model (MESI, MESIO, ....) of the CPU, but are NOT atomic on the PCIe bus, and they are not always "really" atomic on the FSB bus either (in the sense of happening in a single FSB operation.) To achieve the Coherency Model atomicity, either the MESI model handles the implies LOCK, or the FSB uses an explicit LOCK. On PCI bus an explicit LOCK was always used. On PCIe, they outlawed explicit LOCK because of the negative pipeline stalling characteristics its caused on the FSB.

So for instance: LOCK ADD instruction:
FSB: Read with Exclusive ownership acquisition, (Read exclusive) CPU now owns the line exclusively. CPU adds the operand to the exclusively owned line. CPU provides the Post ALU operation value a register, or cached cacheline. CPU has the line in Exclusive DIRTY state. Eventually someone else will ask for the LINE (a reader) The CPU will provide the LINE to the read, do a writeback in the same operation, and change the state depending on how the line was read. (If read for sharing, the line is changed to Modified or Shared depending on the writeback conditions and the particular MESI protocol derivative in use.) (If read Exclusive (from another CPU), then the line is Own Exclusive in that CPU, along with the writeback to the memory controller (in Intel, AMD varies here, again, MESI variations.)

If the Line is an UNBUFFERED/UNCACHED, PCI Bus Device Memory line (in the PCI Device, NOT PCIe device) READ EXCLUSIVE on the FSB by the CPU of the memory line. Host Bus Bridge does a LOCKED READ on PCIe. HOST Bus Bridge (Root Port) Provides the line Exclusive to CPU. CPU adds operand of instruction to Memory address. CPU caches the result in Cached Memory temp store write combining buffer. CPU evicts from WC buffer (as the line is UNBUFFERED/UNCACHED) to the RP. RP receives the FSB CPU writeback (and takes exclusive ownership of the dirty line).) RP initiates a LOCKED Posted WRITE on the PCIe link. RP markes the FSB ownership of the line as clean, SHARED state. RP UNLOCKS the PCIe link.

On PCIe devices, the old LOCK based primities are NOT supported. The PCIe spec makes that explicitly illegal.

The CMPXCHG is similar, with a READ and then a WRITE (with the Compare occuring in the CPU), and the XCHG instruction works the same way, utilizing LOCK on the PCIe bus. (But no lock for XCHG or CMPXCHNG for lines that are in the Coherency Domain, e.g. that are WRITEBACK type, or WRITETHRU) that are hosted in the FSB memory controller domain, not in the PCIe space.

So the old x86 atomics "work" on PCI bus (and do NOT work on PCIe links), but they DO NOT product the PCIe Atomics TLP's. It would be possible for Intel implement PCIe Atomic REQUESTER capability logic, and either change the LOCK ADD, CMPXCHG, XCHG instruction implementations (unlikely do to not wanting to disrupt any legacy subtle semantics, or they could generate NEW instructions that would map directly to PCIe FetchAdd, Swap, CompareAndSwap. But to my knowledge, Intel has not done this, yet. And they probably never will. (Because eventually CXL will take over this space, and that simply extends the Cohrency Model down onto the CXL Bus (PCIe-like...) and they then do not have to change their existing implementation or semantics. (Seems the more likely path for Intel.)

AMD: It appears that AMD does support both the Requester and Completer roles on some CPU's. You can determine if the Completer role is supported by looking in the Dev2Cap in the PCI Express capability structure in a root port. Same for the Fowarding Capability. There is for some reason (ask the SIG) no capability bit to report the support of the Requester Function. However there is a control bit for the Requester function. (That bit cannot be used as a proxy Capability bit. If 0 and not accepting a 1, the CPU does not have Requester capability. If 0 and accepts a write to 1 (its enabled, IF the CPU has the function, but the acceptance of a 1 write does not imply the CPU has the Requester function.... Per the PCIe Spec.)

Bottom line, you can use x86 legacy LOCK operations only on legacy PCI bus devices, but NOT on PCIe devices. You can use PCIe atomics on PCIe devices, but only in Device to Host Memory operations on most CPU.

For CPU to Device usage of PCIe Atomics, most Intel CPU do not support this, as they lack the Atomics Requester Role support. But most AMD CPU DO support the Atomics Requester Role. Neither vendor is hard in fast across the board, accept for support of Device to Host. (The Host to Device is the dodgy part.)

Update: There are some signs that the Intel IceLake CPU (MIGHT) support the requester role, maybe. It has a writeable Requester Enable bit, which just means it MIGHT support the Requester role, rather than having a non-writeable Requester Enable bit which means the CPU definitely does NOT have the functionality. If the Intel IceLake has such support, I have yet to find any documentation on what compiler primitives (or raw machine/assembly language instructions can be used to generate the PCIe Atomics primitives. They cannot be the old LOCK add, LOCK inc, LOCK XCHG, CMPXCHG because those instructions have semantics for PCI Bus (not PCIe) usage that must be maintained. If there are new instructions, I have not found them yet.

AMD's AMD ROCm library support product and the __atomic_store_n() routine appears to allow one to generate PCIe Atomic Ops on AMD processors. It would be interesting to see if that library can be used on an Intel IceLake and also generate a CPU to Device PCIe atomic operation. As of this update, I still don't know the answer to that last. This link might be interest: https://github.com/RadeonOpenCompute/ROCm/commit/23beff10b8916c5302ff0df6750c3585e01ea517

It talks to ROCm support (for Radeon), on both AMD, and some Intel CPU's. Not sure yet what that implies instruction support wise.

BZKN · Answer 2 · 2022-01-12T09:51:31.477

OP Question 1: My question is what if the memory comes from the above PCIe shared memory (and mapped to user's virtual memory space)? Does the atomic operation still hold?

Yes. Both FPGA and CPU host software can request a lock for exclusive access to a memory region to perform atomic operations. For example, OpenCL shared virtual memory (SVM) introduces fine-grained host-device synchronization, which allows the host and device to access shared data structures concurrently and synchronize at the granularity of atomic load/store instructions. This enables true concurrency between software threads and FPGA kernels in the presence of shared data structures.
Having said that, such synchronization for concurrent memory access through atomic load/store operations requires a mechanism to ensure that a CPU or FPGA hardware kernel/accelerator access to shared data is guarded against an interfering access to the same location by the other side until the access has been completed (atomicity of the access).
Furthermore the answer on SO here says that PCIe 3.0 does support certain "Locked Transactions".
Furthermore, since your question has mentioned FPGA, lets take a concrete example. You can also understand about atomic operation for 7 Series FPGAs Integrated Block for PCI Express v3.3. It mentions the 7 Series FPGAs Integrated Block for PCI Express supports both sending and receiving atomic operations (atomic Ops) as defined in the PCI Express Base Specification v2.1. The specification defines three TLP types that allow advanced synchronization mechanisms amongst multiple producers and/or consumers. The integrated block treats atomic Ops TLPs as Non-Posted Memory Transactions. The three TLP types are:
- FetchAdd
- Swap
- CAS (Compare And Set)

OP Question 2: If yes, how is its perf compare to the same atomic operation on the regular memory?

This depends. One of the significant factors is also the size of the data. For example, in some applications the same atomic operation can perform better on regular memory system if array size is small. On the other hand, the same atomic operation can be better for SVM with larger array sizes. At times in case of SVM achieving equal runtime performance to regular memory can also be considered a performance gain since SVM itself has overheads.

What you're saying may be true for certain devices, but I'm not sure that it is generally true for any generic PCIe device. See [my answer](https://stackoverflow.com/a/70677297/119527). — Jonathon Reinhart, Jan 12 '22 at 06:41

score -1 · Answer 3 · answered Jan 12 '22 at 06:40

I think the answer is "No."

Atomic operations like __atomic_fetch_add are implemented (on x86) as an instruction with a LOCK prefix. This prefix would traditionally tell the CPU to "lock" the bus by asserting a LOCK# signal which other physical CPUs would respect. Nowdays, this atomicity is all handled by the cache coherency protocol (MESI) which dictates the behavior of the cache hierarchy.

See What is processor Lock# signal and how it works?.

The point is that these CPU instructions only protect the memory against other CPUs.

So you may be able to use atomic instructions to provide atomicity from the perspective of software running on other CPU cores, but to my knowledge there are no atomic primitives available on the PCIe protocol that would provide atomicity against the device itself.

See: http://xillybus.com/tutorials/pci-express-tlp-pcie-primer-tutorial-guide-1

Edit: Actually, I might be wrong. This answer says that PCIe 3.0 does support certain "Locked Transactions". But I'm not sure that an x86 CPU will translate a lock inc instruction against a memory-mapped PCIe address to a PCIe FetchAdd instruction. I would be very interested to hear more insight here.

x86 `lock inc [mem]` is atomic wrt all possible observers in the system, including DMA. Not just other CPUs. That's why the `lock` prefix has existed 8086, which had zero ambition of supporting SMP. I don't know for sure that `lock inc` on a PCIe device memory region would translate to a PCIe FetchAdd operation, but that's what I'd expect. (I don't write device drivers, so this is purely a guess.) — Peter Cordes, Jun 11 '22 at 21:32
Or as MesaGuy's answer says, atomicity between CPU instructions on its own DRAM and devices doing atomic RMW on the host's DRAM. Intel does support that. But that answer says Intel implements atomicity for `lock inc [pcie_mem]` correctly but inefficiently, with it locking the bus instead of sending a FetchAdd request, or equivalent for xchg or cmpxchg. (I think that's a correct summary based on skimming it.) — Peter Cordes, Jun 11 '22 at 21:38

use atomic operations on the PCIe host/device shared memory?

3 Answers3