OpenCL - need to recommended structure

Question

I have two files each one has 10000 points each point has two double number X and Y. What we need is operation on all of these points, so we have 10000 0000 operations (10000 X 10000).

First Question: What the structure do you recommend? I mean which variable I should pass to Kernel file?

I have already write this script and executed it for 1000 point files (1000000 operations), I have put all points in one array (1000000 X 4) - 4 came from X,Y from first file and X,Y from another file - and passed it to kernel so I had 1000000 parallel threads.

local_item_size = 125
global_item_size = 1000000

Second Question: Do you think I can improve this structure and how?

Third Question: The script that I have written is working correctly for 1000 points files but when I run it for 10000 point files I faced CL_CREATEBUFFER error (CL_INVALID_BUFFER_SIZE for 100000000 * 4double input array). I think (BUT I AM NOT SURE) the reason is the huge number of generated threads (100000000)!!

UPDATE: - The Hardware is (Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, NVIDIA Corporation GM204 [GeForce GTX 980]). - We have the for loop with 1000 (3 ifs) operations for each point, these operations was done in kernel and the result on each point completely independent from all other points.

UPDATE2: SIMPLIFY THE PROBLEM - We need to multiply two matrix A and B, A has 10000 rows and 2 columns and B has 2 rows and 10000 columns what is the best structure to do this?

Thanks in advance,

What is the HW you are using? How big is your computation for each point? Is it just a couple of operations or something bigger? Is the operation on each point completely independent from all other points? (trivially parallel tasks)? — Kris, May 12 '15 at 09:10
This might be a probelm in terms of size indeed on this http://stackoverflow.com/questions/8520421/what-is-the-size-limit-for-a-class question you can see that you exceed the maximum size. What might help is creating a multidimensional pointer array. Like here: http://www.cplusplus.com/forum/articles/7459/ — laurisvr, May 12 '15 at 09:16
Furthermore, I agree with @Krystian, please provide us with more details as to how you have laid out your list now. And which operations you perform. — laurisvr, May 12 '15 at 09:17
@Krystian @ Laurisvr, I updated the question and added required details — Rami Aqqad, May 12 '15 at 09:44
One simple suggestion: use vector data types. You can have float2 instead of two floats for x- and y-coordinates. — Christian, May 12 '15 at 11:38
@Christian, I need do the operations between all points from first file A and all points from second one B. So I put them in one array (XA, YA, XB, YB) and used float4 array. The size of this array is number of points*number of points. this structure was working when the number of points was 1000 (the input array size:1000000) but it is not working for 10000(the input array size 100000000). I think I have to change this structure but how can I get that? — Rami Aqqad, May 12 '15 at 12:05
What you construct is somehow the cross product of the point lists. This leads to n^2 memory usage. I would kindly ask, what speaks against two separated arrays. One for the points from file A and one from file B. I guess, your operation still will do with those two arrays. — Christian, May 12 '15 at 14:37
@Christian, yes it seems like the cross product of the point lists, but the question is how can I pass the all B file points to each point of A file in Kernel? Is there any sample program do that? What is the best structure for this if I have 100000 points for A file and same number for B file? — Rami Aqqad, May 12 '15 at 14:47

score 1 · Answer 1 · answered May 12 '15 at 19:38

I suggest trying a blocking approach to break the problem into manageable chunks.

Create a kernel which can process 32x2 block of A and a 2x32 block of B. The result should be 32x32x8 bytes, and fit into the local memory of both devices. Call this kernel as many times as needed.

Process the remainder of the elements (ie the partial blocks which weren't processed by the kernel) on the host while the main compute device is doing its work.

A search for "matrix multiplication blocking algorithm" gets some decent hits. Here is a good one.

score 1 · Accepted Answer · answered May 13 '15 at 04:08

Regarding Update 2: the best way of handling matrices is storing them in row-order-column. You need two Matrices with 20000 elements each. In Matrix A the elements are stored with 10000 elements per row and two rows altogether. In matrix B this gives us 10.000 rows, but each row only 2 elements.

Take a look at my profile to my blog. There is a (german) tutorial for OpenCL based matrix multiplication.

OpenCL - need to recommended structure

2 Answers2