Essentially yes, but there are a lot of caveats. Some with severe performance penalties that make atomics usage questionable in many cases (though still useful for cases where other methods of synchronization would cost even more.)
The PCIe standard describes essentially three scenarios in which a host CPU/RootPort is involved (though it does not discuss the scenarios really, just the groupings), so I will fill in the blanks a bit and color between the lines here. And compare contrast that to the x86 Intel instruction set support. (I do not know what AMD's current situation is, but I would GUESS it to the same, but AMD may have done more or less.)
PCIe: The Atomics Operations Defined by the PCIe standard, define indirectly three cases:
- CPU to Device Atomics. (The Memory is in the Device, access by the CPU).
- Device to CPU/Memory Atomics. (The Memory is in the Host/CPU, accessed by device)
- And Device to Device Atomics. (The Memory is in one device, accessed by another)
Those three cases are assumed as "obvious" by the Standard, but not really so obvious. The PCIe standard then defines three subsets of support for Root Ports or Endpoints within the standard.
a. Requester Support. (To initiate Atomic Operations).
b. Completer Support. (To support reception and carrying out of Requested PCIe Atomics)
c. Forwarding Support (To support forwarding Atomics from one Root Port to a different Root Port, in the role of handling Device to Device Atomics, where the two devices are not under the same Root Port. This also applies to forwarding capabilities in PCIe Switches, which must forward upstream or downstream even within a single Host Root Port dual device tree situation where a PCIe switch is present.
TO ACCOMPLISH:
Device to HOST operations:
- The device must support REQUESTER Capability for Atomic (and the ops).
- The host must support the COMPLETER Capability for Atomics (and the ops).
- Any PCIe switches involved must support the FORWARDING Capability. (But some support the forwarding, without having the Forwarding capability, but then you cannot tell if the switch really works or not programmatically.
HOST to Device operations (The OP's question.)
- The HOST must support REQUESTER Capability for Atomics (and be so enabled)
- The Device must support COMPLETER Capability for Atomics (and be so enabled)
- Any switches should report the FOWARDER Capability, but some do foward with the Capability being advertised.
For a while now, Intel CPU's have supported the COMPLETER OPERATION only. They have not supported the FORWARDING or the REQUESTER capabilities (and those PCIe extended capability DevCap2 bits are not set. (Forwarding is not, the Requester Function does not have a DevCap2 bit.)
Thus for most Intel CPU, the only support DEVICE TO HOST Atomics. They do NOT support DEVICE to DEVICE atomics when the two devices are under different PCIe Root Ports (no FORWARDING SUPPORT), but can handle DEVICE to DEVICE when a PCIe switch is installed in the root port, that supports FORWARDING (most do) and has the two devices below that switch, bypassing Intel's lack of support. But given that all this is optional, in IMHO, you can really only count on DEVICE to HOST support, not device to device, nor Host to Device.
Intel CPU's support of PCIe ATOMICS in the Requester role has to date been absent. The Intel CPU's do NOT support PCIe ATOMICS Requester Role (only Completer Role). Because they do not support the PCIe ATOMICS Requester role, there is also no corresponding implementation of instructions that lead to the generation of PCIe FetchAdd, Swap, CmpAndSwap operational PCIe TLP from the "FSB" (e.g. QPI, UPI) to the PCIe via the Root Port and "FSB" logic (e.g the Bus Unit of the CPU).
That said, Intel x86 "legacy atomics" (NOT PCIe ATOMICS) are still supported, but NOT on PCIe. The PCISIG (PCIe standards body), at Intel and AMD's request, outlawed the support of LOCK on PCIe, except in one limited case. LOCK was supported on the predecesor PCI (not Express). LOCK on PCIe is supported only to be carried to PCI devices through a PCIe to PCI bridge, and in no other way. (PCIe standard). So while the old LOCK add, LOCK inc, etc primitives DO WORK with PCI devices, they do NOT work with PCIe devices!
The PCIe ATOMICS are atomic ON THE PCIe "bus" (link topology). The Intel Legacy LOCK based x86 "atomics" are atomic within the Coherency Model (MESI, MESIO, ....) of the CPU, but are NOT atomic on the PCIe bus, and they are not always "really" atomic on the FSB bus either (in the sense of happening in a single FSB operation.) To achieve the Coherency Model atomicity, either the MESI model handles the implies LOCK, or the FSB uses an explicit LOCK. On PCI bus an explicit LOCK was always used. On PCIe, they outlawed explicit LOCK because of the negative pipeline stalling characteristics its caused on the FSB.
So for instance:
LOCK ADD instruction:
FSB: Read with Exclusive ownership acquisition, (Read exclusive)
CPU now owns the line exclusively.
CPU adds the operand to the exclusively owned line.
CPU provides the Post ALU operation value a register, or cached cacheline.
CPU has the line in Exclusive DIRTY state.
Eventually someone else will ask for the LINE (a reader)
The CPU will provide the LINE to the read, do a writeback in the same operation, and change the state depending on how the line was read.
(If read for sharing, the line is changed to Modified or Shared depending on the writeback conditions and the particular MESI protocol derivative in use.)
(If read Exclusive (from another CPU), then the line is Own Exclusive in that CPU, along with the writeback to the memory controller (in Intel, AMD varies here, again, MESI variations.)
If the Line is an UNBUFFERED/UNCACHED, PCI Bus Device Memory line (in the PCI Device, NOT PCIe device)
READ EXCLUSIVE on the FSB by the CPU of the memory line.
Host Bus Bridge does a LOCKED READ on PCIe.
HOST Bus Bridge (Root Port) Provides the line Exclusive to CPU.
CPU adds operand of instruction to Memory address.
CPU caches the result in Cached Memory temp store write combining buffer.
CPU evicts from WC buffer (as the line is UNBUFFERED/UNCACHED) to the RP.
RP receives the FSB CPU writeback (and takes exclusive ownership of the dirty line).)
RP initiates a LOCKED Posted WRITE on the PCIe link.
RP markes the FSB ownership of the line as clean, SHARED state.
RP UNLOCKS the PCIe link.
On PCIe devices, the old LOCK based primities are NOT supported. The PCIe spec makes that explicitly illegal.
The CMPXCHG is similar, with a READ and then a WRITE (with the Compare occuring in the CPU), and the XCHG instruction works the same way, utilizing LOCK on the PCIe bus. (But no lock for XCHG or CMPXCHNG for lines that are in the Coherency Domain, e.g. that are WRITEBACK type, or WRITETHRU) that are hosted in the FSB memory controller domain, not in the PCIe space.
So the old x86 atomics "work" on PCI bus (and do NOT work on PCIe links), but they DO NOT product the PCIe Atomics TLP's. It would be possible for Intel implement PCIe Atomic REQUESTER capability logic, and either change the LOCK ADD, CMPXCHG, XCHG instruction implementations (unlikely do to not wanting to disrupt any legacy subtle semantics, or they could generate NEW instructions that would map directly to PCIe FetchAdd, Swap, CompareAndSwap. But to my knowledge, Intel has not done this, yet. And they probably never will. (Because eventually CXL will take over this space, and that simply extends the Cohrency Model down onto the CXL Bus (PCIe-like...) and they then do not have to change their existing implementation or semantics. (Seems the more likely path for Intel.)
AMD:
It appears that AMD does support both the Requester and Completer roles on some CPU's. You can determine if the Completer role is supported by looking in the Dev2Cap in the PCI Express capability structure in a root port. Same for the Fowarding Capability. There is for some reason (ask the SIG) no capability bit to report the support of the Requester Function. However there is a control bit for the Requester function. (That bit cannot be used as a proxy Capability bit. If 0 and not accepting a 1, the CPU does not have Requester capability. If 0 and accepts a write to 1 (its enabled, IF the CPU has the function, but the acceptance of a 1 write does not imply the CPU has the Requester function.... Per the PCIe Spec.)
Bottom line, you can use x86 legacy LOCK operations only on legacy PCI bus devices, but NOT on PCIe devices. You can use PCIe atomics on PCIe devices, but only in Device to Host Memory operations on most CPU.
For CPU to Device usage of PCIe Atomics, most Intel CPU do not support this, as they lack the Atomics Requester Role support. But most AMD CPU DO support the Atomics Requester Role. Neither vendor is hard in fast across the board, accept for support of Device to Host. (The Host to Device is the dodgy part.)
Update: There are some signs that the Intel IceLake CPU (MIGHT) support the requester role, maybe. It has a writeable Requester Enable bit, which just means it MIGHT support the Requester role, rather than having a non-writeable Requester Enable bit which means the CPU definitely does NOT have the functionality. If the Intel IceLake has such support, I have yet to find any documentation on what compiler primitives (or raw machine/assembly language instructions can be used to generate the PCIe Atomics primitives. They cannot be the old LOCK add, LOCK inc, LOCK XCHG, CMPXCHG because those instructions have semantics for PCI Bus (not PCIe) usage that must be maintained. If there are new instructions, I have not found them yet.
AMD's AMD ROCm library support product and the __atomic_store_n() routine appears to allow one to generate PCIe Atomic Ops on AMD processors. It would be interesting to see if that library can be used on an Intel IceLake and also generate a CPU to Device PCIe atomic operation. As of this update, I still don't know the answer to that last.
This link might be interest:
https://github.com/RadeonOpenCompute/ROCm/commit/23beff10b8916c5302ff0df6750c3585e01ea517
It talks to ROCm support (for Radeon), on both AMD, and some Intel CPU's. Not sure yet what that implies instruction support wise.