6

I am able to list the following parameters which help in restricting the work items for a device based on the device memory:

  • CL_DEVICE_GLOBAL_MEM_SIZE
  • CL_DEVICE_LOCAL_MEM_SIZE
  • CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
  • CL_DEVICE_MAX_MEM_ALLOC_SIZE
  • CL_DEVICE_MAX_WORK_GROUP_SIZE
  • CL_DEVICE_MAX_WORK_ITEM_SIZES
  • CL_KERNEL_WORK_GROUP_SIZE

I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly. Can somebody please tell me what these parameters mean and how they are used. Is it necessary to check all these parameters?

PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.

Cool_Coder
  • 4,888
  • 16
  • 57
  • 99

1 Answers1

11

CL_DEVICE_GLOBAL_MEM_SIZE:

  • Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)

CL_DEVICE_LOCAL_MEM_SIZE:

  • Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.

CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:

  • The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)

CL_DEVICE_MAX_MEM_ALLOC_SIZE:

  • The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)

CL_DEVICE_MAX_WORK_GROUP_SIZE:

  • Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.

CL_DEVICE_MAX_WORK_ITEM_SIZES:

  • The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.

CL_KERNEL_WORK_GROUP_SIZE:

  • The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).

NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.

tigertang
  • 445
  • 1
  • 6
  • 18
DarkZeros
  • 8,235
  • 1
  • 26
  • 36
  • 1. regarding CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, is this the global constant memory or for private variable in a kernel? – Cool_Coder Apr 11 '14 at 16:10
  • Is the amount memory used by the arguments to the kernel passed with __constant flag. They may be normal buffers outside of the kernel. But inside, if defined that way, they will be a very fast speed read_only memory. You should use this for parameters that will be repeated across many work items. Static parameters (integers, structs, etc) will also be some short of constant memory (depending on the implementation). – DarkZeros Apr 11 '14 at 16:13
  • 2. why is CL_DEVICE_MAX_WORK_GROUP_SIZE required when we already have CL_DEVICE_MAX_WORK_ITEM_SIZES? – Cool_Coder Apr 11 '14 at 16:14
  • CL_DEVICE_MAX_WORK_GROUP_SIZE indicates how big a group may be. CL_DEVICE_MAX_WORK_ITEM_SIZES indicates the limit of how a group can be shaped in the N space of dimensions. (typically 3 or less) – DarkZeros Apr 11 '14 at 16:15
  • 3. Lets say the no of work group's is equal to max compute units. So if I take CL_KERNEL_WORK_GROUP_SIZE as local work group size, then that means all the required memory checks are done automatically by OpenCL compiler? So that I need not check these parameters again. – Cool_Coder Apr 11 '14 at 16:19
  • CL_DEVICE_MAX_COMPUTE_UNITS defines how many WG can be run in parallel by the device. But this is just an informative parameter that does not affect any other parameter. A good practice is to check everything. Another lazy approach is to check only when it fails. If you select the default values for WG size, you don't need to check anything unless you run out of memory (local/global/constant), which rarely happens. – DarkZeros Apr 11 '14 at 16:30
  • Ideally how many times CL_DEVICE_MAX_COMPUTE_UNITS work groups should we create i.e. say CL_DEVICE_MAX_COMPUTE_UNITS = 6. Then how many work groups do you create (6*1) or (6*2) or (6*3) or ... – Cool_Coder Apr 11 '14 at 16:43
  • 1
    @Cool_Coder I really don't understand why you want to select how many work groups you want to run in total. What you have to select, is how many work items (global size), and how will you split these global size in small parallel chuncks for the hardware (work group size). If the optimal work group size is 256, and your work is 1024. The hardware will run 4 work groups. But if it is 1M of size, it will run 4096 work groups. Of course the second case will take more time, but in both cases the work group size is the ideal one (256). (typically the one provided by OpenCL (default) is OK) – DarkZeros Apr 11 '14 at 17:28
  • 2
    @Cool_Coder Supose I need to multiply 1G (1073741824) numbers by a scale factor of X. And my hardware does it in the optimal way with a WG size of 1024. Then I will say `global = 1073741824` `local = 1024`. And that is all. The hardware will run 1M work groups, in the N compute units sequentially (if it has only 4, then each cycle only 4 will be processed) until finishing. – DarkZeros Apr 11 '14 at 17:36
  • ok understood. What I was thinking is distributing work groups in different enqueueNDRangeKernel calls. Looks like that is pretty inefficient. As you suggested I will now enqueue all work groups's in a single call. Thanks for your guidance! – Cool_Coder Apr 12 '14 at 06:00