0

I have started learning openCL programming. As a starting i am looking at writing an optimized code for the following 3rd degree polynomial:

g(x)= b1(x).f(x)+b2(x).(f(x))^2+b3(x).(f(x)))^3

The above equation can be reduced to the following:

g(x) = f(x)[b1(x)+f(x)[b2(x)+f(x).b3(x)]]

which reduces the number of multiplications to great extent.

Suppose if my f, b1,b2 and b3 are matrices of size 500x500. The following are the options which i thought of implementing this algo:

  1. Implement a kernel with 500x500 threads, each operating on one element of the matrix.
  2. Implement a kernel with 500 threads, each operating on 500 elements, i.e, each thread operating on one row.

Moreover, the arrays b1,b2,b3 are constant arrays. I read that the constant arrays can be moved to the device and keep it there locally on the device memory. Please share across if there are any other optimizations possible.

Thanks in advance

sravan

Sravan
  • 131
  • 1
  • 2
  • 10

2 Answers2

0

Here are some suggestions:

  1. Align the size of the matrix to the preferred work-group size for the device (usually 64 items works best for most devices, but you can query the value using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE in clGetKernelWorkGroupInfo).
  2. Calculate more than one element per work-item, to reduce the overhead - although 500 elements per work-item is not ideal, because 1) it is unnecessarily coarse-grained and 2) it limits you to only 500 work-items.
  3. Inside the kernel, unroll your loop that's iterating over the elements to do at least 4 elements in one iteration.
  4. You should definitely keep the b1, b2 and b3 arrays on the device. As a rule, keep as much as possible on the device.

For a thorough list of all optimizations possible you should read the optimization guide from the OpenCL device vendor for the device that you are using.

Lubo Antonov
  • 2,301
  • 14
  • 18
0

You definitely want to be compute-bound with your algorithm, rather than memory-bound. There is a lot of computation to do here, so that should not be a problem. I have responded to a couple of other questions regarding efficient memory access patterns. #1, #2.

I have found that your suggestion #2 works best -- ie: have a workgroup calculate the results of an entire row at a time. I extend this idea by using fewer groups, and having them move on to other rows as the work is finished. Using a barrier for the group at the end of the row's work won't kill performance noticeably since you have enough work to be done. If your memory is ordered such that row[i*w+500] === row[(i+1)*w], you can get your work groups to crunch two rows at a time and avoid a little bit of down time.

I think rows of 500 are large enough work to keep the compute unit saturated without having to do two rows at a time. On my platform, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE is 64; if 64, 128, or 256 id used as the workgroup size, only 12 work items are idle for the last iteration on the row. The goal is to use the smallest workgroup possible while NOT being memory bound (and a multiple of the preferred workgroup size).

500 workgroups is probably too many. Some multiple (I like to use 1) of your compute units on the device will work well (see: CL_DEVICE_MAX_COMPUTE_UNITS). If you have too many groups, they will either wait to be scheduled, or will suffer from being swapped in/out of the cores/registers of your device.

You definitely want to store the constant data on the device if you have enough memory to do so. Even writing the output to local memory and copying it to global at the end of a row computation may boost the performance.

Community
  • 1
  • 1
mfa
  • 5,017
  • 2
  • 23
  • 28
  • In my case the condition row[i*w+500] === row[(i+1)*w] hold good. Though i need to code out and time the code. Thanks for this suggestion. Currently i am trying to optimize this code on a intel xeon with 12 cores. To analyze the performance i took data of size 256x828. I have set my globalRange as 256 and localRange as 4 giving number of work groups as 64. The performance have improved a lot.Though with other scenario's i am yet to try out. I read that vectorization of the float can improve a lot of in terms of performance. Can you point to me some good links about that. Thanks – Sravan Apr 04 '12 at 13:26