No, except for a special case.
This can't be performed atomically, in the general case where a
, b
,c
, and d
are arbitrary (i.e. not necessarily adjacent), and/or x
,y
,z
, w
are each 32 bits or larger.
I'm using "atomically" to refer to an atomic RMW operation that the hardware provides.
Such operations are limited to a maximum of 64-bits total, so 4 32-bit or larger quantities could not work. Furthermore all data must be contiguous and "naturally" aligned, so independent locations cannot be accessed in a single atomic cycle.
In the special case where the 4 quantities are 16-bit or 8-bit quantities, and adjacent and aligned, you could use a custom atomic.
Alternatives to consider:
You can use critical sections to achieve such things, probably at considerable performance cost, code complexity, and fragility.
Another alternative is to recast your algorithm to use some form of parallel reduction. Since you appear to be operating at the threadblock level, this may be the best approach.