Passing struct with pointer members to OpenCL kernel using PyOpenCL

Question

Let's suppose I have a kernel to compute the element-wise sum of two arrays. Rather than passing a, b, and c as three parameters, I make them structure members as follows:

typedef struct
{
    __global uint *a;
    __global uint *b;
    __global uint *c;
} SumParameters;

__kernel void compute_sum(__global SumParameters *params)
{
    uint id = get_global_id(0);
    params->c[id] = params->a[id] + params->b[id];
    return;
}

There is information on structures if you RTFM of PyOpenCL [1], and others have addressed this question too [2] [3] [4]. But none of the OpenCL struct examples I've been able to find have pointers as members.

Specifically, I'm worried about whether host/device address spaces match, and whether host/device pointer sizes match. Does anyone know the answer?

[1] http://documen.tician.de/pyopencl/howto.html#how-to-use-struct-types-with-pyopencl

[2] Struct Alignment with PyOpenCL

[3] http://enja.org/2011/03/30/adventures-in-opencl-part-3-constant-memory-structs/

[4] http://acooke.org/cute/Somesimple0.html

CaptainObvious · Accepted Answer · 2013-07-28T09:38:23.617

No, there is no guaranty that address spaces match. For the basic types (float, int,…) you have alignment requirement (section 6.1.5 of the standard) and you have to use the cl_type name of the OpenCL implementation (when programming in C, pyopencl does the job under the hood I’d say).

For the pointers it’s even simpler due to this mismatch. The very beginning of section 6.9 of the standard v 1.2 (it’s section 6.8 for version 1.1) states:

Arguments to kernel functions declared in a program that are pointers must be declared with the __global, __constant or __local qualifier.

And in the point p.:

Arguments to kernel functions that are declared to be a struct or union do not allow OpenCL objects to be passed as elements of the struct or union.

Note also the point d.:

Variable length arrays and structures with flexible (or unsized) arrays are not supported.

So, no way to make you kernel runs as described in your question and that's why you haven’t been able to find some examples of OpenCl struct have pointers as members.
I still can propose a workaround that takes advantage of the fact that the kernel is compiled in JIT. It still requires that you pack you data properly and that you pay attention to the alignment and finally that the size doesn’t change during the execution of the program. I honestly would go for a kernel taking 3 buffers as arguments, but anyhow, there it is.

The idea is to use the preprocessor option –D as in the following example in python:

Kernel:

typedef struct {
    uint a[SIZE];
    uint b[SIZE];
    uint c[SIZE];
} SumParameters;

kernel void foo(global SumParameters *params){
    int idx = get_global_id(0);
    params->c[idx] = params->a[idx] + params->b[idx];
}

Host code:

import numpy as np
import pyopencl as cl

def bar():
   mf = cl.mem_flags
   ctx = cl.create_some_context()
   queue = cl.CommandQueue(self.ctx)
   prog_f = open('kernels.cl', 'r')
   #a = (1, 2, 3), b = (4, 5, 6)          
   ary = np.array([(1, 2, 3), (4, 5, 6), (0, 0, 0)], dtype='uint32, uint32, uint32')
   cl_ary = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=ary)
   #Here should compute the size, but hardcoded for the example
   size = 3
   #The important part follows using -D option
   prog = cl.Program(ctx, prog_f.read()).build(options="-D SIZE={0}".format(size))    
   prog.foo(queue, (size,), None, cl_ary)
   result = np.zeros_like(ary)
   cl.enqueue_copy(queue, result, cl_ary).wait()
   print result

And the result:

[(1L, 2L, 3L) (4L, 5L, 6L) (5L, 7L, 9L)]

score 0 · Answer 2 · answered Jul 14 '13 at 02:04

I don't know the answer to my own question, but there are 3 workarounds I can come up with off the top of my head. I consider Workaround 3 the best option.

Workaround 1: We only have 3 parameters here, so we could just make a, b, and c kernel parameters. But I've read there's a limit on the number of parameters you can pass to a kernel, and I personally like to refactor any function that takes more than 3-4 arguments to use structs (or, in Python, tuples or keyword arguments). So this solution makes the code harder to read, and doesn't scale.

Workaround 2: Dump everything in a single giant array. Then the kernel would look like this:

typedef struct
{
    uint ai;
    uint bi;
    uint ci;
} SumParameters;

__kernel void compute_sum(__global SumParameters *params, uint *data)
{
    uint id = get_global_id(0);
    data[params->ci + id] = data[params->ai + id] + data[params->bi + id];
    return;
}

In other words, instead of using pointers, use offsets into a single array. This looks an awful lot like the beginnings of implementing my own memory model, and it feels like it's reinventing a wheel that exists somewhere in PyOpenCL, or OpenCL, or both.

Workaround 3: Make setter kernels. Like this:

__kernel void set_a(__global SumParameters *params, __global uint *a)
{
    params->a = a;
    return;
}

and ditto for set_b, set_c. Then execute these kernels with worksize 1 to set up the data structure. You still need to know how big a block to allocate for params, but if it's too big, nothing bad will happen (except a little wasted memory), so I'd say just assume the pointers are 64 bits.

This workaround's performance is probably awful (I imagine a kernel call has enormous overhead), but fortunately that shouldn't matter too much for my application (my kernel is going to run for seconds at a time, it's not a graphics thing that has to run at 30-60 fps, so I imagine that the time taken by extra kernel calls to set parameters will end up being a tiny fraction of my workload, no matter how high the per-kernel-call overhead is).

I realise it's 6 years later BUT, is workaround 3 valid OpenCL? Specifically, is it valid to access global memory addresses outside of buffers passed as kernel args? I've ported some CUDA code to OpenCL 1.2 using this approach and it works on MOST GPUs (Intel, NVIDIA and ARM) but, on some AMD GPUs, updates made to buffers accessed via pointers don't seem to 'stick' (unless you put a printf in the kernel for extra sketchyness). Theories are: 1) Pointers aren't actual global memory addresses on these devices - no idea why printf would help 2) Updates to buffer are getting optimized out — neworderofjamie, Jul 07 '20 at 14:56

Passing struct with pointer members to OpenCL kernel using PyOpenCL

2 Answers2

Linked