Well, almost any CUDA intrinsic can be replaced, so I'll interpret your question as
Can __shfl_xor
be replaced cheaply on SM_21 GPUs?
And the answer is: Not really; you'll incur a penalty. Your best option, as @RobertCrovella's comment suggests is to use the shared memory:
- Each lane writes its data into a location in shared memory (make these consecutive 4-byte-sized values to avoid bank conflicts)
- Perform some kind of synchronization (probably you'll have to
__syncthreads()
)
- Each lane reads from the shared memory position into which the lane whose value it wants has written.
I didn't spell out the code to not take the fun away for you :-)
edit: While the execution of shuffle is more complex, it is still, semantically at least, an operation on registers; and it does not require synchronization. So the shared-memory alternative would be slower.