I am working with image processing application , gray-scale images only - the GPU occupancy is limited by the increase number of vector registers per workgroup and Local Memory per workgroup.
The read_imagef() function returns float4 , however my application works with only the first three components of the float4 - so there is an extra float operation per any computation (hence increases execution time).
nevertheless - the kernel perform many Multiply Add ops also on float4
How can I optimize this kernel so that it uses less vector-registers and if there is are tips-tricks to increase the MAD ops speed (knowing that i have tried the hardware supported function and the performance went down).