How to make the most of SIMD in OpenCL?

Question

In the optimization guide of Beignet, an open source implementation of OpenCL targeting Intel GPUs

Work group Size should be larger than 16 and be multiple of 16.

As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD lanes, we need to follow this rule.

Also mentioned in the Compute Architecture of Intel Processor Graphics Gen7.5:

For Gen7.5 based products, each EU has seven threads for a total of 28 Kbytes of general purpose register file (GRF).

...

On Gen7.5 compute architecture, most SPMD programming models employ this style code generation and EU processor execution. Effectively, each SPMD kernel instance appears to execute serially and independently within its own SIMD lane.

In actuality, each thread executes a SIMD-Width number of kernel instances >concurrently. Thus for a SIMD-16 compile of a compute kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executing concurrently on a single EU. Similarly, for SIMD-32 x 7 threads = 224 kernel instances executing concurrently on a single EU.

If I understand it correctly, using the SIMD-16 x 7 threads = 112 kernel instances as a example, in order to run 224 threads on one EU, the work group size need to be 16. Then the OpenCL compiler will fold 16 kernel instances into a 16 lane SIMD thread, and do this 7 times on 7 work groups, and run them on a single EU?

Question 1: am I correct until here?

However OpenCL spec also provide vector data types. So it's feasible to make full use of the SIMD-16 computing resources in a EU by conventional SIMD programming(as in NEON and SSE).

Question 2: If this is the case, using vector-16 data type already makes explicit use of the SIMD-16 resources, hence removes the at-least-16-item-per-work-group restrictions. Is this the case?

Question 3: If all above are true, then how does the two approach compare with each other: 1) 112 threads fold into 7 SIMD-16 threads by OpenCL compiler; 2) 7 native threads coded to explicitly use vector-16 data types and SIMD-16 operations?

Tim · Answer 1 · 2016-06-02T21:41:21.433

Almost. You are making the assumptions that there is one thread per workgroup (N.B. thread in this context is what CUDA calls a "wave". In Intel GPU speak a work item is a SIMD channel of a GPU thread). Without subgroups, there is no way to force a workgroup size to be exactly a thread. For instance, if you choose a WG size of 16, the compiler is still free to compile SIMD8 and spread it amongst two SIMD8 threads. Keep in mind that the compiler chooses the SIMD width before the WG size is known to it (clCompileProgram precedes clEnqueueNDRange). The subgroups extension might allow you to force the SIMD width, but is definitely not implemented on GEN7.5.
OpenCL vector types are an optional explicit vectorization step on top of the implicit vectorization that already happens automatically. Were you to use float16 for example. Each of the work items would be processing 16 floats each, but the compiler would still compile at least SIMD8. Hence each GPU thread would be processing (8 * 16) floats (in parallel though). That might be a bit overkill. Ideally we don't want to have to explicitly vectorize our CL by using explicit OpenCL vector types. But it can be helpful sometimes if the kernel is not doing enough work (kernels that are too short can be bad). Somewhere it says float4 is a good rule of thumb.
I think you meant 112 work items? By native thread do you mean CPU threads or GPU threads?
- If you meant CPU threads, the usual arguments about GPUs apply. GPUs are good when your program doesn't diverge much (all instances take similar paths) and you use the data enough times to mitigate the cost transferring it to and from the GPU (arithmetic density).
- If you meant GPU threads (the GEN SIMD8 or SIMD16 critters). There is no (publicly visible) way to program the GPU threads explicitly at the moment (EDIT see the subgroups extension (not available on GEN7.5)). If you were able to, it'd be a similar trade off to assembly language. The job is harder, and the compiler sometimes just does a better job than we can, but when you are solving a specific problem and have better domain knowledge, you can generally do better with enough programming effort (until hardware changes and your clever program's assumptions becomes invalidated.)

How to make the most of SIMD in OpenCL?

1 Answers1