I have two files each one has 10000 points each point has two double number X and Y. What we need is operation on all of these points, so we have 10000 0000 operations (10000 X 10000).
First Question: What the structure do you recommend? I mean which variable I should pass to Kernel file?
I have already write this script and executed it for 1000 point files (1000000 operations), I have put all points in one array (1000000 X 4) - 4 came from X,Y from first file and X,Y from another file - and passed it to kernel so I had 1000000 parallel threads.
local_item_size = 125
global_item_size = 1000000
Second Question: Do you think I can improve this structure and how?
Third Question: The script that I have written is working correctly for 1000 points files but when I run it for 10000 point files I faced CL_CREATEBUFFER error (CL_INVALID_BUFFER_SIZE for 100000000 * 4double input array). I think (BUT I AM NOT SURE) the reason is the huge number of generated threads (100000000)!!
UPDATE: - The Hardware is (Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, NVIDIA Corporation GM204 [GeForce GTX 980]). - We have the for loop with 1000 (3 ifs) operations for each point, these operations was done in kernel and the result on each point completely independent from all other points.
UPDATE2: SIMPLIFY THE PROBLEM - We need to multiply two matrix A and B, A has 10000 rows and 2 columns and B has 2 rows and 10000 columns what is the best structure to do this?
Thanks in advance,