I have started learning openCL programming. As a starting i am looking at writing an optimized code for the following 3rd degree polynomial:
g(x)= b1(x).f(x)+b2(x).(f(x))^2+b3(x).(f(x)))^3
The above equation can be reduced to the following:
g(x) = f(x)[b1(x)+f(x)[b2(x)+f(x).b3(x)]]
which reduces the number of multiplications to great extent.
Suppose if my f, b1,b2 and b3 are matrices of size 500x500. The following are the options which i thought of implementing this algo:
- Implement a kernel with 500x500 threads, each operating on one element of the matrix.
- Implement a kernel with 500 threads, each operating on 500 elements, i.e, each thread operating on one row.
Moreover, the arrays b1,b2,b3 are constant arrays. I read that the constant arrays can be moved to the device and keep it there locally on the device memory. Please share across if there are any other optimizations possible.
Thanks in advance
sravan