2

I'm pretty new to the whole OpenCL world, and I have created two pretty simple Kernels and I am trying to chain them togehter, but I am getting rather spurious results. When ran individually, they work as expected, but when slapped together, that is when I am seeing the strange results.

So, Each Kernel individually look like this

Vector 3 Noise

__kernel void addVector3Noise( __global struct State* states, __global float3* randomVector3Values){
    int stateNum = get_global_id(0);
    struct State state = states[stateNum];
    float3 randomVal = randomVector3Values[stateNum];
    struct State newState;
    newState.Vec3 = (float3)(state.Vec3.x + randomVal.x,state.Vec3.y + randomVal.y,state.Vec3.z + randomVal.z); 
    newState.Vec4 = state.Vec4;
    states[stateNum] = newState;
}

For testing this, all of the states had Vec3 of [ 1.0f, 1.0f, 1.0f] with all of the random values being the same, so the output I get is an array of states with the values [2.0f, 2.0f, 2.0f] as I would Expect.

Vector 4 Noise

__kernel void addVector4Noise(__global struct State* states,
__global float3* randomVector4Values){
    int stateNum = get_global_id(0);
    struct State state = states[stateNum];
    float3 randomVal = randomVector4Values[stateNum];
    float4 newVector4 = randomQuaternionRotation(state.Vector4, randomVal);
    struct State newState;
    newState.Vector3 = state.Vector3;
    newState.Vector4= newVector4;
    states[stateNum] = newState;    
}

Running this with very simple test data also gives me what I want.

Now the issue comes in when chaining them together. I invoked them in the order Vector 4 noise -> Vector 3 noise. Now when running the Vector 4 noise Kernel, I am seeing the vector 3 values change, and the change seems to follow a pattern.

So, after the vector 4 kernel has been ran, I would expect the vector 3 in each state to be the same as when it was plugged in. So that would mean every state would have a Vector 3 value of [1.0f,1.0f,1.0f] The following is what I am actually seeing the vector 3's come out as:

[1.0,1.0,1.0] 
[0.576367259,1.0,1.0]
[0.999199867,0.6448302,1.0]
[1.313311, 1.067663, 0.3307195]
[-0.08005857, 1.067663, 1.450237]
[1, 0.2340522, 1.136126]
[1, 1, 0.3025152]
[1, 1, 1]

And that pattern repeats itself throughout all of the Vector 3 Values. Note that in the Kernel, it is just copying the Vector3 from the previous state into the new state.

This is how I have them chained together using OpenCL.Net

    using (var env = "*".CreateCLEnvironment(DeviceType.Gpu))
    {           
        var source = LoadProgram("kernels.cl");
        var context = env.Context;

        ErrorCode errorCode;
        var program = Cl.CreateProgramWithSource(context, 1u, source, null, out errorCode);
        CheckSuccess(errorCode);
        errorCode = Cl.BuildProgram(program, (uint)env.Devices.Length, env.Devices, "-cl-opt-disable", null,
            IntPtr.Zero);
        if (errorCode != ErrorCode.Success)
        {
            var info = Cl.GetProgramBuildInfo(program, env.Devices[0], ProgramBuildInfo.Log, out errorCode).ToString();
            throw new Exception(info);
        }

        var kernels = Cl.CreateKernelsInProgram(program, out errorCode);
        CheckSuccess(errorCode);
        var Vector4NoiseKernel = kernels[0];
        var Vector3NoiseKernel = kernels[1];

        var rnd = new Random();
        var states = Enumerable.Range(1, ArrayLength)
            .Select(_ => new State
            {
                Vector3 = new Vector3(1, 1, 1),
                Vector4 = new Vector4(0.5f,0.5f,0.5f,0.5f)
            })
            .ToArray();
        var randomVector4Values = Enumerable.Range(1, ArrayLength)
            .Select(_ => new Vector3(2f, 2f, 2f))
            .ToArray();

        var randomVector3Values = Enumerable.Range(1, ArrayLength)
            .Select(_ => new Vector3(1f, 1f, 1f))
            .ToArray();

        var vector4StatesBuffer = context.CreateBuffer(states, MemFlags.ReadWrite);
        var randomVector4ValuesBuffer = context.CreateBuffer(randomVector4Values, MemFlags.ReadOnly);

        Event ev;

        Cl.SetKernelArg(vector4NoiseKernel, 0, vector4StatesBuffer);
        Cl.SetKernelArg(vector4NoiseKernel, 1, randomVector4ValuesBuffer);

        errorCode = Cl.EnqueueNDRangeKernel(env.CommandQueues[0], vector4NoiseKernel, 1, null
            , new[] { new IntPtr(ArrayLength) }, new[] { new IntPtr(1) }, 0u, null, out ev);
        errorCode.Check();

        env.CommandQueues[0].ReadFromBuffer(vector4StatesBuffer, states, waitFor: ev);

        var randomVector3ValuesBuffer = context.CreateBuffer(randomVector3Values, MemFlags.ReadOnly);
        var vector3StatesBuffer = context.CreateBuffer(states, MemFlags.ReadWrite);


        Cl.SetKernelArg(vector3NoiseKernel, 0, vector3StatesBuffer);
        Cl.SetKernelArg(vector3NoiseKernel, 1, randomVector3ValuesBuffer);

        errorCode = Cl.EnqueueNDRangeKernel(env.CommandQueues[0], vector3NoiseKernel, 1, null
            , new[] { new IntPtr(ArrayLength) }, new[] { new IntPtr(1) }, 0u, null, out ev);
        errorCode.Check();

        Cl.Finish(env.CommandQueues[0]).Check();
        env.CommandQueues[0].ReadFromBuffer(vector3StatesBuffer, states, waitFor: ev);
    }

Excuse the huge dollop of code there, but this is a playground project, and I am pretty much just vomiting ideas out, so tidyness and elegance are not an issue here :)

Thanks in advance for any help you may be able to provide.

EDIT So the first thing I have done this morning is pull each kernel out into its own cl file and make sure each has its own version of state with only what is required for it (Vector4 and Vector3 respectively), along with a new using statement with all the gubbins that goes along with it for the newly separated out Vector3 noise kernel. To my joy, the Vector4 noise kernel did exactly as I expected it to do, however, when it came to the Vector3 noise, a similar issue to previously occured. Still passing in [1.0f,1.0f,1.0f] as both the random values and the starting Vector3 values, and it is still not producing the output I expect. The pattern that repeats this time is:

[2.0f,2.0f,2.0f]
[1.0f,2.0f,2.0f]
[2.0f,1.0f,2.0f]
[2.0f,2.0f,1.0f]
[2.0f,2.0f,2.0f]
David Watts
  • 2,249
  • 22
  • 33
  • Typical question: Are you checking there is no error in the execution? Maybe it is not even running at all. – DarkZeros Oct 06 '15 at 10:45
  • @DarkZeros It is definitely running. If it wasn't surely I wouldn't see the values change at all. It is only ever one element of each array in the output that is incorrect. – David Watts Oct 06 '15 at 10:46

2 Answers2

3

In OpenCL, the 3-component vector types occupy the same size as the 4-component vector types. For example, a float3 is defined to be 16-bytes, not 12 bytes. If the data-structures you use on the host (the Vector3 class in this case) are not the same size, you will likely run into issues.

The pattern in the output in your edited post is three 2.0s and then 1.0, which indicates that this is likely the cause of the behaviour you are seeing.

One solution would be to use a Vector4 on the host side in place of Vector3.

jprice
  • 9,755
  • 1
  • 28
  • 32
  • Will look into this. Thanks – David Watts Oct 06 '15 at 12:48
  • Ok, so using a vector 4 with the last element in both as NaN, everything seems to be working ok. That presents and issue though in how I deal with there being this needless NaN on the end of every vector. Trying to find some information on how Marshalling works in OpenCL.net so I can pack a bunch of Vector3's as vector4's so I can feed them in and read out without having the 'pointless' NaN – David Watts Oct 06 '15 at 12:52
  • jprice is right, Just change ALL to vector 4, including the host side, otherwise you are going to not fully initialize that data in the GPU (due to the vector length being shorter) – DarkZeros Oct 06 '15 at 13:19
  • I've managed to make it work using the vector 3's with some marshalling gubbins now I have been enlightened to this size difference. Thank you bothe very much for your help – David Watts Oct 06 '15 at 13:33
  • @jprice do you have a reference table for type sizes on the OpenCL side. I'm working my way through the khronos documentation, but I haven't come across anything yet – David Watts Oct 06 '15 at 13:54
  • @DavidWatts No table, but in the OpenCL 1.2 specification section 6.1.5 "Alignment of Types" should give you all the information you need. – jprice Oct 06 '15 at 14:01
1

So, after @jprice pointed out the size difference between the C#and OpenCL types, I managed to solve this issue by explicitly declaring the total size of my struct on the C# side to be in line with what is expected on the OpenCL side, so now my struct looks like this.

[StructLayout(LayoutKind.Sequential, Size = 32)]
public struct State
{
    public Vector3 Vector3;
    public Vector4 Vector4;
}

As the float3 and float4 on the OpenCL side are both 16bytes, allocating 32 bytes to my struct which contains both has lead to the right behaviour.

David Watts
  • 2,249
  • 22
  • 33