float4 - multiply-add - performance tips OpenCL

Question

I am working with image processing application , gray-scale images only - the GPU occupancy is limited by the increase number of vector registers per workgroup and Local Memory per workgroup.

The read_imagef() function returns float4 , however my application works with only the first three components of the float4 - so there is an extra float operation per any computation (hence increases execution time).

nevertheless - the kernel perform many Multiply Add ops also on float4

How can I optimize this kernel so that it uses less vector-registers and if there is are tips-tricks to increase the MAD ops speed (knowing that i have tried the hardware supported function and the performance went down).

score 2 · Answer 1 · edited May 23 '17 at 10:30

2

If you work with gray-scale images only, you could implement you own 'read_imagef()' which read only one channel of the images, so that everything you deal with is float.

As your data may be interleaved in memory as RGBRGB.... Loading only R channel probably cost same time as loading all channels. It is array of struct situation. You could find more details here.

Structure of Arrays vs Array of Structures in cuda

Given the data layout, you could load float4/float3, extract one channel of float from it and then do computation on the extracted float.

kernel perform many Multiply Add ops also on float4

I don't get why your kernel has to do those ops on float4. Maybe you want to show some code the demonstrate that.

edited May 23 '17 at 10:30

Community

1
1

answered Jul 03 '16 at 05:27

kangshiyin

9,681
1
17
29

this maybe a solution , but the reading access pattern may stall the computations , a float4 read is more efficient(in my machine) , so the performance is restricted by **CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT** – mmain Jul 03 '16 at 23:07
@OmarGW Do you mean you still need to access all three channels? – kangshiyin Jul 04 '16 at 03:41
I mean accessing the three channela gives a better performance ----- what is your idea about implementing my own **read_imagef** ? – mmain Jul 04 '16 at 13:35
@OmarGW The point is you do the computing part only on one channel of the data. – kangshiyin Jul 04 '16 at 13:40
Indeed I am - one channel only – mmain Jul 04 '16 at 13:42
@OmarGW see my update. you could load all channels only but use only one channel during computation, can't you? It effectively equals to loading one channel with the interleaved data layout. – kangshiyin Jul 04 '16 at 13:59
many thanks , you give me an idea here - this explains alot - the MAD part could now be done on 'float' , my main concern here is trying to optimize the MAD ops to squeeze some performance out of it as possible - but that's OK for now – mmain Jul 04 '16 at 14:15

score 1 · Answer 2 · answered Jul 03 '16 at 13:05

1

If it returns float4 and if it does that within same number of inernal memory operations as float3, then it would be same latency. A mad operation is much shorter latency than a memory operation.

There is no float3 hardware as I know, so you can compute 3 elements one by one if it is a scalar micro architecture (such as a new gpu). If it is vliw-4 then it will use 4th element at the same time or not, it will have same speed.

answered Jul 03 '16 at 13:05

huseyin tugrul buyukisik

11,469
4
45
97

that seems reasonable, it is indeed the same latency , however float4 uses 4 float instruction , where in float3 it uses 3 float instructions , and this increases execution time ( I test this myself) , so my opinion was to save some flops by using float3 instead of float4 – mmain Jul 04 '16 at 13:41

score 0 · Answer 3 · answered Jul 03 '16 at 07:35

It's hard to be specific here, since it depends on what hardware it is and what you are doing to the image content - you may be better off dealing with the image as a plain buffer of bytes (with your own conversion to float, but that risks adding more CL code, and reduces the use of the texture unit in the GPU that does the conversion for you, assuming there is HW for this).

Another option is to do four readimage_f calls, and "merge" the values into one float4, do your math, and split the result.

Unfortunately, although OpenCL is "portable", it's not "portable with known performance", so what works well in one OpenCL implementation, may not work in another, and tweaking/tuning the algorithm for performance requires good understanding of the architecture as a whole.

I manually worked with each of the 3 components (for intermediate results) and combined it back to a float4 , and it did enhanced the performance and decreased the vector-register count , however not significantly. maybe for a bigger problem size this may be beneficial. --- as for the performance portability issue , this is the OpenCL nightmare unlike CUDA. — mmain, Jul 03 '16 at 21:36

float4 - multiply-add - performance tips OpenCL

3 Answers3