Can we use `shuffle()` instruction for reg-to-reg data-exchange between items (threads) in WaveFront?

Question

As we known, WaveFront (AMD OpenCL) is very similar to WARP (CUDA): http://research.cs.wisc.edu/multifacet/papers/isca14-channels.pdf

GPGPU languages, like OpenCL™ and CUDA, are called SIMT because they map the programmer’s view of a thread to a SIMD lane. Threads executing on the same SIMD unit in lockstep are called a wavefront (warp in CUDA).

Also known, that AMD suggested us the (Reduce) addition of numbers using a local memory. And for accelerating of addition (Reduce) suggests using vector types: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/AMD_OpenCL_Tutorial_SAAHPC2010.pdf

But are there any optimized register-to-register data-exchage instructions between items (threads) in WaveFront:

such as int __shfl_down(int var, unsigned int delta, int width=warpSize); in WARP (CUDA): https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
or such as __m128i _mm_shuffle_epi8(__m128i a, __m128i b); SIMD-lanes on x86_64: https://software.intel.com/en-us/node/524215

This shuffle-instruction can, for example, execute Reduce (add up the numbers) of 8 elements from 8 threads/lanes, for 3 cycles without any synchronizations and without using any cache/local/shared-memory (which has ~3 cycles latency for each access).

I.e. threads sends its value directly to register of other threads: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

Or in OpenCL we can use only instruction gentypen shuffle( gentypem x, ugentypen mask ) which can be used only for vector-types such as float16/uint16 into each item (thread), but not between items (threads) in WaveFront: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/shuffle.html

Can we use something looks like shuffle() for reg-to-reg data-exchange between items (threads) in WaveFront which more faster than data-echange via Local memory?

Are there in AMD OpenCL instructions for register-to-register data-exchange intra-WaveFront such as instructions __any(), __all(), __ballot(), __shfl() for intra-WARP(CUDA): http://on-demand.gputechconf.com/gtc/2015/presentation/S5151-Elmar-Westphal.pdf

Warp vote functions:

__any(predicate) returns non-zero if any of the predicates for the threads in the warp returns non-zero
__all(predicate) returns non-zero if all of the predicates for the threads in the warp returns non-zero
__ballot(predicate) returns a bit-mask with the respective bits of threads set where predicate returns non-zero
__shfl(value, thread) returns value from the requested thread (but only if this thread also performed a __shfl()-operation)

CONCLUSION:

As known, in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts: Does the official OpenCL 2.2 standard support the WaveFront?

For Sub-Group there are - page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf

int sub_group_all(int predicate) the same as CUDA-__all(predicate)
int sub_group_any(int predicate); the same as CUDA-__any(predicate)

But in OpenCL there is no similar functions:

CUDA-__ballot(predicate)
CUDA-__shfl(value, thread)

There is only Intel-specified built-in shuffle functions in Version 4, August 28, 2016 Final Draft OpenCL Extension #35: intel_sub_group_shuffle, intel_sub_group_shuffle_down, intel_sub_group_shuffle_down, intel_sub_group_shuffle_up: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt

Also in OpenCL there are functions, which usually implemented by shuffle-functions, but there are not all of functions which can be implemented by using shuffle-functions:

<gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id );
<gentype> sub_group_reduce_<op>( <gentype> x );
<gentype> sub_group_scan_exclusive_<op>( <gentype> x );
<gentype> sub_group_scan_inclusive_<op>( <gentype> x );

Summary:

shuffle-functions remain more flexible functions , and ensure the fastest possible communication between threads with direct register-to-register data-exchanging.
But functions sub_group_broadcast/_reduce/_scan doesn't guarantee direct register-to-register data-exchanging, and these sub-group-functions less flexible.

@huseyin tugrul buyukisik Yes I need to access a private register of another thread. For example, if I have small array in Local memory with 32-64 elements and I want to summarize them. Then the fastest way is to use 32-64 threads, but without too much accesses to Local memory. — Alex, Feb 15 '17 at 22:56
if array is large, work_group_-reduce-scan commands may help — huseyin tugrul buyukisik, Feb 15 '17 at 23:02
I am not entirely sure, but the `swizzle` operations described here [1] remind me of the nvidia shuffle. Can somebody more knowledgeable comment on that? [1]: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ — mSSM, Mar 16 '17 at 12:29
@mSSM `swizzle` from link to which you referenced is part of **HIP C++**, but not part of **OpenCL**. And yes, in HIP C++ there are instructions such as CUDA-`__shfl()`, because HIP C++ is modified clone of CUDA: http://stackoverflow.com/a/42562600/1558037 And in OpenCL there isn't `swizzle`-function, but there are `swizzle`-expressions such as `float3 v3 = pv1->s321;` which operate only on registers **inside 1 thread (SIMD-lane), not between threads (SIMD-lanes)** - part 2.1.2.3 of **Standard OpenCL C++ Page 16 - 26**: https://www.khronos.org/registry/OpenCL/specs/opencl-2.2-cplusplus.pdf — Alex, Mar 16 '17 at 13:34

huseyin tugrul buyukisik · Accepted Answer · 2017-02-19T22:29:31.040

There is

gentype work_group_reduce<op> ( gentype  x)

for version >=2.0

but its definition doesn't say anything about using local memory or registers. This just reduces each collaborator's x value to a single sum of all. This function must be hit by all workgroup-items so its not on a wavefront level approach. Also the order of floating-point operations is not guaranteed.

Maybe some vendors do it register way while some use local memory. Nvidia does with register I assume. But an old mainstream Amd gpu has local memory bandwidth of 3.7 TB/s which is still good amount. (edit: its not 22 TB/s) For 2k cores, this means nearly 1.5 byte per cycle per core or much faster per cache line.

For %100 register(if not spills to global memory) version, you can reduce number of threads and do vectorized reduction in threads themselves without communicating with others if number of elements are just 8 or 16. Such as

v.s0123 += v.s4567
v.s01 += v.s23
v.s0 += v.s1

which should be similar to a __m128i _mm_shuffle_epi8 and its sum version when compiled on a CPU and non-scalar implementations will use same SIMD on a GPU to do these 3 operations.

Also using these vector types tend to use efficient memory transactions even for global and local, not just registers.

A SIMD works on only a single wavefront at a time, but a wavefront may be processed by multiple SIMDs, so, this vector operation does not imply a whole wavefront is being used. Or even whole wavefront may be computing 1st elements of all vectors in a cycle. But for a CPU, most logical option is SIMD computing work items one by one(avx,sse) instead of computing them in parallel by their same indexed elements.

If main work group doesn't fit ones requirements, there are child kernels to spawn and use dynamic width kernels for this kind of operations. Child kernel works on another group called sub-group concurrently. This is done within device-side queue and needs OpenCl version to be at least 2.0.

Look for "device-side enqueue" in http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf

AMD APP SDK supports Sub-Group

Thank you! Can you add to the answer that now AMD APP SDK supports Sub-Group akin to WaveFronts? And there are `sub_group_all(predicate)` and `sub_group_any(predicate)` which similar to CUDA-`__all(predicate)` and CUDA-`__any(predicate)`. But there are not functions similar to CUDA-`__ballot(predicate)` and CUDA-`__shfl(value, thread)`. Page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf And I will accept an answer. — Alex, Feb 19 '17 at 22:01
actually it is opencl feature, not for only amd. If it is 2.0 device, it has to support or let it queried it as I know of. I added to end of answer. yes it becomes more like CUDA bot not exactly. " Whereas out-of-order device queues are mandatory and they are supported by any OpenCL 2.0 enabled devices." is said here: https://community.amd.com/thread/170319 — huseyin tugrul buyukisik, Feb 19 '17 at 22:24

Can we use `shuffle()` instruction for reg-to-reg data-exchange between items (threads) in WaveFront?

1 Answers1