OpenCL and Tesla M1060

Question

I'm using the Tesla m1060 for GPGPU computation. It has the following specs:

# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor)  240
Memory Interface (512-bit per GPU)  512-bit

When I use OpenCL, I can display the following board information:

available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30 
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383

How can I relate the GPU card informations to the OpenCL memory informations ?

For example:

What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?

Thanks

EDIT:

After the following answers, there is a thing that is still unclear to me:

The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.

However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.

In the OpenCL 1.1 Api, it is written (p. 15):

Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit

It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.

As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:

Entry function uses too much shared data (0x4020 bytes, 0x4000 max).

The value 16 works.

Addendum

Here is my Kernel signature:

// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
    #pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
    #error "IEEE-754 double precision not supported by OpenCL implementation."
#endif

#define BLOCK_SIZE 16 // --> this is what defines the WG size to me

__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
  void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
  __local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
  __local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
  // stuff that does matrix multiplication with __local
}

In the host code part:

#define BLOCK_SIZE 16 
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);

can you post your kernel code please? at least the parts defining the local memory amount. if this is dynamic, post clenqueuendrangekernel and clsetkernelarg calls from your host code please. — mfa, Nov 26 '14 at 14:17
also, by 'set number of groups to 32", do you mean that you double the global size, or are you halving the work group size to do this? — mfa, Nov 26 '14 at 16:33
thanks for adding the code. what variables are local in the kernel? it seems that you are allocating 2*(BLOCK_SIZE^2)*8 bytes for doubles, and another 32 bytes beyond that. it's the extra 32 bytes that put you over the limit for your device. — mfa, Nov 26 '14 at 17:56
Actually, there are 2 BLOCK_SIZE*BLOCK_SIZE arrays. I edited my message. — Carmellose, Nov 26 '14 at 18:19

score 2 · Accepted Answer · edited May 23 '17 at 10:10

2

The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)

The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.

The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo

I have answered a similar questions about work group sizing. see here, and here

Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.

edited May 23 '17 at 10:10

Community

1
1

answered Nov 21 '14 at 15:24

mfa

5,017
2
23
28

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE returns 32. Isn't it strange because max compute units is 30? – Carmellose Nov 21 '14 at 16:00
1

The compute units are how many work groups may run in parallel. Each work group is composed of 8 stream cores (8*30=240). And probably, each stream core can operate with up to 32 WI, thats why you are getting 32 as the prefered multiple. Also, 32*8=512 the maximum work items you can use. But please, don't bother with these, just query the CL values and use them (HW independent) – DarkZeros Nov 21 '14 at 16:16
1

More than 30 work groups may run in parallel, since more than one work group may be resident on a compute unit, provided that it has sufficient resources available. You would typically need more than 30 work groups, and many more than 240 work items total in all work groups, to fully utilize the device. Preferred work group size multiple of 32 is due to the way hardware schedules work items execution. I suggest you study CUDA documentation to understand the details. Similar questions arise regularly on SO. @DarkZeros - your comment is quite misleading by the way. – void_ptr Nov 22 '14 at 02:20
@carmellose - you may want to refer to this question/answers: http://stackoverflow.com/questions/10460742/how-do-cuda-blocks-warps-threads-map-onto-cuda-cores. Note this is written in CUDA terminology, however, since your questions is more about NVIDIA hardware than it is about OpenCL, CUDA-related answers and documentation address it much more directly. For instance, 32 is warp size which is a very well known CUDA concept. – void_ptr Nov 22 '14 at 02:28
1

@void_ptr Why the comment missleading? He has 240 cores and 30 compute units, so the structure is clearly 8 cores per compute unit. Then, as you said, each core can handle a 32-warp (or 32 work items), so, the workgroup max size is 512. – DarkZeros Nov 24 '14 at 10:43
@DarkZeros - Each core cannot "handle" 32 work items. Every clock cycle, one instruction from one work item can get scheduled on a core, so a warp of 32 threads (work items) takes 4 clock cycles to get scheduled on the 8 cores of a compute unit on devices of compute capability 1.x. Maximum x-dimension of a thread block being equal to 512 has nothing to do with 8 cores handling 32 threads each, which they do not. If you want to understand this, there is really no way around studying NVIDIA documentation, rather than taking guesses at the numbers. And this subject has been beaten to death on SO. – void_ptr Nov 24 '14 at 16:08
When I set the number of Work Groups to 32, the programs fails with the following error `Entry function uses too much shared data (0x4020 bytes, 0x4000 max)`. – Carmellose Nov 26 '14 at 11:14
Im updating my question – Carmellose Nov 26 '14 at 13:18
1

You would typically want to keep the number of threads (work items) in a block (work group) a multiple of 32 and in a reasonable range (128-256 is a good starting point), and the number of blocks in a grid large enough, at least a few times more than your number of compute units. Using too much shared memory (per block, or local memory, in OpenCL terms) will reduce your achievable occupancy though, and this may impact the performance. You may want to investigate a way to lower the shared memory requirements of your kernel as a measure to improve occupancy and performance. – void_ptr Nov 26 '14 at 20:34

score 1 · Answer 2 · answered Nov 26 '14 at 19:20

Adding another answer to address your local memory issue.

Entry function uses too much shared data (0x4020 bytes, 0x4000 max)

Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.

0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.

You will have to use a BLOCK_SIZE value less than 32 to work around this for now.

There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)

Thanks ! I'm selecting your first answer as the one, but this gives much information to me as well. Cheers :) — Carmellose, Nov 27 '14 at 09:44

OpenCL and Tesla M1060

2 Answers2