Ideas for CUDA kernel calls with parameters exceeding 256 bytes

Question

I have a couple of structures that summed up exceed the 256 bytes size allowed to be passed as parameters in a kernel call.

Both structures are already allocated and copied to device global memory.

1) How can I make use in the same kernel of these structures without being passed as parameters?

More details. Separately, these structures can be passed as parameters. For example, in different kernels. But:

2) How can I use both structures in the same kernel?

If the structures are already allocated and copied to device global memory, you should just be able to pass a pointer to those areas. There should be no need or reason to pass these by value. — Robert Crovella, Feb 19 '14 at 23:53
@RobertCrovella Thank. I was not sure I could point from the CPU-side kernel call and how I could point to those areas. Compute capability 1.3 has so many limitations! — mrei, Feb 20 '14 at 16:18

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

3

If your data structures are already in global memory, then you can just pass a pointer in as the kernel argument.

On a related note, the limit for kernel arguments is 4KB for devices of compute capability 2.x and higher:

global function parameters are passed to the device:

via shared memory and are limited to 256 bytes on devices of compute capability 1.x,

via constant memory and are limited to 4 KB on devices of compute capability 2.x and higher.

device and global functions cannot have a variable number of arguments.

(c.f. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters)

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 20 '14 at 10:54

Tom

20,852
4
42
54

Thanks. I read that part in the documentation. At the moment tho, I am bound on compute capability 1.3 (Tesla C1060). Hopefully, upgrading soon (hopefully before I finish my thesis). – mrei Feb 20 '14 at 16:20

score 2 · Accepted Answer · edited Feb 21 '14 at 12:50

As Robert Crovella suggested in his comment, you should just be able to pass a pointer to those areas. I have had similar problem in opencl.. This is how I implemented the struct:

(My kernel and host functions are in opencl, syntax can be the issue for you..but the context is same.!)

Following two are defined in my 'Mapper.c'--> Host function

typedef struct data
{
  double dattr[10];
  int d_id;
  int bestCent;
}Data;


typedef struct cent
{
  double cattr[5];
  int c_id;
}Cent;

Data *dataNode;
Cent *centNode;

After allocating memory on Device's global memory, I transferred the data. I had to redefine the struct definitions in my other kernel function as below:

mapper.cl:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
typedef struct data
{
  double dattr[10];
  int d_id;
  int bestCent;
}Data;


typedef struct cent
{
  double cattr[5];
  int c_id;
}Cent;

__kernel void mapper(__global int *keyMobj, __global int *valueMobj,__global Data *dataMobj,__global Cent *centMobj)
{
    int i= get_global_id(0);
    int j,k,color=0;
    double dmin=1000000.0, dx;
    for(j=0; j<2; j++)      //here 2 is number of centroids considered
     {
        dx = 0.0;
        for(k=0; k<2; k++)
        {
           dx+= ((centMobj[j].cattr[k])-(dataMobj[i].dattr[k])) * ((centMobj[j].cattr[k])-(dataMobj[i].dattr[k]));
        }  
        if(dx<dmin)            
        {  color = j;   
           dmin = dx;
        }
     }  
     keyMobj[i] = color;
     valueMobj[i] = dataMobj[i].d_id;

}

You can see that I have passed only pointer to those areas.. i.e. keyMobj and valueMobj.

kernel = clCreateKernel(program, "mapper", &ret);
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&keyMobj);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&valueMobj);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&dataMobj);
ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&centMobj);

Above lines of code is belongs to host side function(mapper.c) which creates kernel function(mapper.cl)..and next 4 lines (clSetKernelArg..) passes the arguments to the kernel function.

Thanks! I get it. I will try to implement in CUDA code and come back with more inputs and probably more questions. — mrei, Feb 20 '14 at 16:16

Ideas for CUDA kernel calls with parameters exceeding 256 bytes

2 Answers2

Linked