4

As known, AMD-OpenCL supports WaveFront (August 2015): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

The AMD Radeon HD 7770 GPU, for example, supports more than 25,000 in-flight work-items and can switch to a new wavefront (containing up to 64 work-items) in a single cycle.


But why in the OpenCL standards 1.0/2.0/2.2 there is no mention about the WaveFront?

None of the PDF has not a word WaveFront: https://www.khronos.org/registry/OpenCL/specs/

Also I found that:

OpenCL is a open standard. It still does not support this swizzling concept. It does not even support wavefront/warp yet.

That's why the concept is not on the OpenCL specification itself.

Standard OpenCL doesn't have the notion of a "wavefront"

enter image description here

Indeed the official OpenCL 2.2 standard still does not support the WaveFront?


CONCLUSION:

There is no WaveFront in OpenCL standard, but in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts.

6.4.2 Workgroup/subgroup-level functions

OpenCL 2.0 introduces a Khronos sub-group extension. Sub-groups are a logical abstraction of the hardware SIMD execution model akin to wavefronts, warps, or vectors and permit programming closer to the hardware in a vendor-independent manner. This extension includes a set of cross-sub-group built-in functions that match the set of the cross-work-group built-in functions specified above.

Community
  • 1
  • 1
Alex
  • 12,578
  • 15
  • 99
  • 195

1 Answers1

2

They must have gone to a more dynamical approach called sub-group: https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf

Sub-group: Sub-groups are an implementation-dependent grouping of work-items within a
work-group. The size and number of sub-groups is implementation-defined.

and

Work-groups are further divided into sub-groups,
which provide an additional level of control over execution.

and

The mapping of work-items to
sub-groups is implementation-defined and may be queried at runtime. 

so even if its not called wavefront, its now queryable in run-time and

In the absence of synchronization functions (e.g. a barrier), work-items within a sub-group may be serialized. In the presence of sub -group functions, work-items within a sub -group may be serialized before any given sub -group function, between dynamically encountered pairs of sub - group functions and between a work-group function and the end of the kernel.

even lockstep manner may be lost at times.

On top of these,

 sub_group_all() and
sub_group_broadcast() and are described in OpenCL C++ kernel language and IL specifications.
The use of these sub-group functions implies sequenced-before relationships between statements
within the execution of a single work-item in order to satisfy data dependencies.

saying that some kind of intra-sub-group communication exists. Because now opencl has child-kernel definition:

Device-side enqueue: A mechanism whereby a kernel-instance is enqueued by a kernel-instance
running on a device without direct involvement by the host program. This produces nested
parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance.
The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the
child kernel) to a device-side command queue. Child and parent kernels execute asynchronously
though a parent kernel does not complete until all of its child-kernels have completed. 

Ultimately, with something like

kernel void launcher()
{
    ndrange_t ndrange = ndrange_1D(1);
    enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
    ^{
    size_t id = get_global_id(0);
    }
    );
}

you should be able to spawn your own (upgraded?)wavefronts with any size you need and they work concurrently with parent kernel(and can communicate intra-sub-group threads) but they are not called wavefronts because they are not hardcoded by hardware imho.


2.0 api specs saying:

Extreme care should be exercised when writing code that uses
subgroups if the goal is to write portable OpenCL applications.

which reminds amd's 16-wide simds and nvidia's 32-wide simds versus some imaginary fpga's 95-wide compute cores. Pseudo-wavefront maybe?

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • Thank you! But what does it mean, and what kind of problem could be: `Extreme care should be exercised when writing code that uses subgroups if the goal is to write portable OpenCL applications`? As there said: `**Sub-groups are a logical abstraction of the hardware SIMD execution model** akin to wavefronts`: page-100: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf – Alex Feb 19 '17 at 14:11
  • 1
    If Sub-groups are a logical abstraction of the hardware SIMD, and if at run-time I can get width of Sub-group (SIMD) for the current device by using `get_sub_group_size()`/`get_max_sub_group_size()`, where can occur a porblem? page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf – Alex Feb 19 '17 at 14:12
  • 2
    get_max_sub_group_size: it says `This value will be invariant for a given set of dispatch dimensions and a kernel object compiled for a given device` so it would work – huseyin tugrul buyukisik Feb 19 '17 at 14:30
  • @Alex I'm not 100% sure, but I think that what the warning may suppose to mean, is that all threads from a given sub-group must perform **exactly** the same instructions: go to the same if-branches, perform the same number of steps in loops, etc to obey SIMD contract (unless you use barrier of course, but I guess the whole point is to avoid it when possible). Some hardware may be more forgiving than another in this regard and that may be the reason for potential portability problems. – morgwai Oct 08 '21 at 10:35
  • @Alex my previous thinking seems to be incorrect, as GPUs actually use [SIMT](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads), which allows for divergences in execution paths among threads (divergences make it much slower as under the hood all threads execute all paths in such case). – morgwai Oct 10 '21 at 08:48