Can be__shfl_xor replaced in order to run on sm_21?

Question

Trying to run this: https://github.com/Celebrandil/CudaSift on a NVS4200M, which is sm_21, not sm_35 as required. The only problem in running the mentioned project is this code (cudaSiftD.cu:205):

for (int i=1;i<=16;i*=2) sum += __shfl_xor(sum, i);

Is there a possible equivalent code?

Pretty much anything you can do with shuffle operations can be done with shared memory operations, which also permits inter-thread communication. I'm not suggesting the implementation is identical, just that there is a "possible equivalent code" using shared memory. — Robert Crovella, Apr 06 '17 at 18:54
@talonmies how can this comment help the OP? This is a non trivial question, as I don't consider shuffle intrinsics as a simple feature of cuda. — Regis Portalez, Apr 07 '17 at 06:47

score 2 · Answer 1 · edited May 23 '17 at 12:17

2

Well, almost any CUDA intrinsic can be replaced, so I'll interpret your question as

Can __shfl_xor be replaced cheaply on SM_21 GPUs?

And the answer is: Not really; you'll incur a penalty. Your best option, as @RobertCrovella's comment suggests is to use the shared memory:

Each lane writes its data into a location in shared memory (make these consecutive 4-byte-sized values to avoid bank conflicts)
Perform some kind of synchronization (probably you'll have to __syncthreads())
Each lane reads from the shared memory position into which the lane whose value it wants has written.

I didn't spell out the code to not take the fun away for you :-)

edit: While the execution of shuffle is more complex, it is still, semantically at least, an operation on registers; and it does not require synchronization. So the shared-memory alternative would be slower.

edited May 23 '17 at 12:17

Community

1
1

answered Apr 06 '17 at 20:23

einpoklum

118,144
57
340
684

1

I would not consider shuffle is one clock cycle for two reasons: 1) there are 32 issuable shuffles per cycle on a multiprocessor [http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions], and 2) the shuffle operation is performed by the cache that manages shared memory. In essence, using shuffle is about twice the performance compared to shared memory - see [http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf] – Florent DUGUET Apr 07 '17 at 09:35
@FlorentDUGUET: Edited to reflect your comment. Your link doesn't work though, I think you have some typo with the parentheses. – einpoklum Apr 07 '17 at 10:47
Broken links in the above comment: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions and http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf – Florent DUGUET Apr 07 '17 at 11:00
@FlorentDUGUET: Comment +1 for pointing out that shuffle execution involves the mechanism for accessing shared memory. – einpoklum Apr 07 '17 at 11:03
Yeah, I understand there is a "price" (aka penalty). The thing is - I am not familiar with CUDA API, so writing code by myself in not an option for me, at least for now. I was just looking for CUDA-accelerated implementation of the SIFT, that could be used with Python. I found it, but encountered a problem described in the question. Anyway, thank you for trying to help me! – Alex Fliker Apr 07 '17 at 19:15
@AlexFliker: That's a different question. You could ask it. I personally have no clue about SIFTs, so I can't help you with that. Also, if you're not proficient enough with CUDA to convert my suggestions into code, I would be wary to even cut-and-paste code someone else has given you. – einpoklum Apr 07 '17 at 19:22
@einpoklum, yeah, I should have mentioned the fact that I am not proficient with CUDA. There is no need to know what is SIFT, because there is no need to rewrite any other code except for the one in the question. Well, it's always a good idea to be wary about copy-and-paste code :-) – Alex Fliker Apr 10 '17 at 12:17

score 0 · Answer 2 · answered Apr 07 '17 at 09:49

If the question is more on how to replace this snippet of code by some compatible with sm_21, you may want to have an eye on CUB, the block-reduce part here. One of the template parameter is the achitecture of your device.

The __CUDA_ARCH__ macro can help you to select the most appropriate implementation, see here.

Can be__shfl_xor replaced in order to run on sm_21?

2 Answers2