0

I am trying to run the following code about OpenCL. In kernel function, I will define an array int arr[1000] = {0};

kernel void test()
{
    int arr[1000] = {0};
}

Then I will create N threads to run the kernel.

cl::CommandQueue cmdQueue;
cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange); // kernel here is the one running test()

My question is, since we know that OpenCL will parallel run the threads, does it mean that, the peak memory will be N * 1000 * sizeof(int)?

C. Wang
  • 2,516
  • 5
  • 29
  • 46

1 Answers1

0

This is not the way to OpenCL (yes, that's what I meant :).

The kernel function operates on kernel operands passed in from the host (CPU) - so you'd allocate your array on the host using clCreateBuffer and set the arg using clSetKernelArg. Your kernel does not declare/allocate the device memory, but simply receives it as an __global argument. Now when you run the kernel using clEnqueueNDRangeKernel, the OpenCL implementation will allocate 1000 ints and run a thread on each of those ints.

If, on the other hand you meant to allocate 1000 ints per work-item (device thread), your calculation is right (yes, they cost memory from the local pool) but it probably won't work. OpenCL work-items have access to only local memory (see here on how to check this for your device) which is severely limited.

Community
  • 1
  • 1
Ani
  • 10,826
  • 3
  • 27
  • 46
  • I think I meant the latter one, i.e. allocate 1000 ints in each thread. Because the array is private variable to each thread, right? So I reckon there will be at most N * 1000 * sizeof(int) memory cost, theoretically. – C. Wang May 13 '16 at 17:44
  • It won't be N * 1000 * sizeof(int), it will be M * 1000 * sizeof(int) where M is the local workgroup size, which was left unspecified, so it up to the runtime. – Dithermaster May 13 '16 at 19:31
  • @Dithermaster Yes, you're technically right. But if you see the OpenCL spec, it's possible that more than one workgroup runs concurrently. This number depends on the memory per workgroup & the implementation. It's also possible that the `clEnqueueNDRangeKernel` call fails entirely with `CL_OUT_OF_RESOURCES`. I always recommend people compute the memory needed for the overall kernel run instead of relying on specifics of a device/implementation. – Ani May 13 '16 at 19:36