OpenCL How to reconstruct buffers when using multiple devices?

Question

I am learning myself openCL in Java using the jogamp jocl libraries. One of my tests is ceating a Mandelbrot map. I have four tests: simple serial, parallel using the Java executor interface, openCL for a single device and openCL for multiple devices. The first three are ok, the last one not. When I compare the (correct) output of the multiple device with the incorrect output of the multiple device solution I notice that the colors are about the same but that the output of the last one is garbled. I think I understand where the problem resides but I can't solve it.

The trouble is (imho) in the fact that openCL uses vector buffers and that I have to translate the output into a matrix. I think that this translation is incorrect. I parallize the code by dividing the mandelbrot map into rectangles where the width (xSize) is divided by the number of tasks and the height (ySize) is preserved. I think I am able to transmit that info correct into the kernel, but translating it back is incorrect.

  CLMultiContext mc = CLMultiContext.create (deviceList);
  try 
  {
     CLSimpleContextFactory factory = CLQueueContextFactory.createSimple (programSource);
     CLCommandQueuePool<CLSimpleQueueContext> pool = CLCommandQueuePool.create (factory, mc);
     IntBuffer dataC = Buffers.newDirectIntBuffer (xSize * ySize);
     IntBuffer subBufferC = null;
     int tasksPerQueue = 16;
     int taskCount = pool.getSize () * tasksPerQueue;
     int sliceWidth = xSize / taskCount;
     int sliceSize = sliceWidth * ySize;
     int bufferSize = sliceSize * taskCount;
     double sliceX = (pXMax - pXMin) / (double) taskCount;
     String kernelName = "Mandelbrot";

     out.println ("sliceSize: " + sliceSize);
     out.println ("sliceWidth: " + sliceWidth);
     out.println ("sS*h:" + sliceWidth * ySize);
     List<CLTestTask> tasks = new ArrayList<CLTestTask> (taskCount);

     for (int i = 0; i < taskCount; i++) 
     {
        subBufferC = Buffers.slice (dataC, i * sliceSize, sliceSize);
        tasks.add (new CLTestTask (kernelName, i, sliceWidth, xSize, ySize, maxIterations, 
              pXMin + i * sliceX, pYMin, xStep, yStep, subBufferC));
     } // for

     pool.invokeAll (tasks);

     // submit blocking immediately
     for (CLTestTask task: tasks) pool.submit (task).get ();

     // Ready read the buffer into the frequencies matrix
     // according to me this is the part that goes wrong
     int w = taskCount * sliceWidth;
     for (int tc = 0; tc < taskCount; tc++)
     {
        int offset = tc * sliceWidth;

        for (int y = 0; y < ySize; y++)
        {
           for (int x = offset; x < offset + sliceWidth; x++)
           {
              frequencies [y][x] = dataC.get (y * w + x);
           } // for
        } // for
     } // for

     pool.release();

The last loop is the culprit, meaning that there is (i think) a mismatch between the kernel encoding and host translation. The kernel:

kernel void Mandelbrot 
(
   const int width,        
   const int height,
   const int maxIterations,
   const double x0,      
   const double y0,
   const double stepX,  
   const double stepY,
   global int *output   
) 
{
    unsigned ix = get_global_id (0);
    unsigned iy = get_global_id (1);

    if (ix >= width) return;
    if (iy >= height) return;

    double r = x0 + ix * stepX;
    double i = y0 + iy * stepY;

    double x = 0;
    double y = 0;

    double magnitudeSquared = 0;
    int iteration = 0;

    while (magnitudeSquared < 4 && iteration < maxIterations) 
    {
        double x2 = x*x;
        double y2 = y*y;
        y = 2 * x * y + i;
        x = x2 - y2 + r;
        magnitudeSquared = x2+y2;
        iteration++;
    }

    output [iy * width + ix] = iteration;
}

The last statement encodes the information into the vector. This kernel is used by the single device version as well. The only difference is that in the multi device version I changed the width and x0. As you can see in the Java code I transmit xSize / number_of_tasks as width and pXMin + i * sliceX as x0 (instead of pXMin).

I am working at it for several days now and have removed quite some bugs, but I am not able to see anymore what I am doing wrong now. Help is greatly appreciated.

Edit 1

@Huseyin asked for an image. First screenshot computed by openCL single device.

Second screenshot is the multi device version, computed with exactly the same parameters.

Edit 2

There was a question about how I enqueue the buffers. As yoy can see in the code above I have a list<CLTestTask> to which I add the tasks and in which the buffer is enqueued. CLTestTask is an inner class of which you can find the code below.

final class CLTestTask implements CLTask { CLBuffer clBufferC = null; Buffer bufferSliceC; String kernelName; int index; int sliceWidth; int width; int height; int maxIterations; double pXMin; double pYMin; double x_step; double y_step;

  public CLTestTask 
  (
        String kernelName, 
        int index,
        int sliceWidth,
        int width, 
        int height,
        int maxIterations,
        double pXMin,
        double pYMin,
        double x_step,
        double y_step,
        Buffer bufferSliceC
  )
  {
     this.index = index;
     this.sliceWidth = sliceWidth;
     this.width = width;
     this.height = height;
     this.maxIterations = maxIterations;
     this.pXMin = pXMin;
     this.pYMin = pYMin;
     this.x_step = x_step;
     this.y_step = y_step;
     this.kernelName = kernelName;
     this.bufferSliceC = bufferSliceC;
  } /*** CLTestTask ***/

  public Buffer execute (final CLSimpleQueueContext qc) 
  {
     final CLCommandQueue queue = qc.getQueue ();
     final CLContext context = qc.getCLContext ();
     final CLKernel kernel = qc.getKernel (kernelName);
     clBufferC = context.createBuffer (bufferSliceC);

     out.println (pXMin + " " + sliceWidth);
     kernel
     .putArg (sliceWidth)
     .putArg (height)
     .putArg (maxIterations)
     .putArg (pXMin) // + index * x_step)
     .putArg (pYMin)
     .putArg (x_step)
     .putArg (y_step)
     .putArg (clBufferC)
     .rewind ();

     queue
     .put2DRangeKernel (kernel, 0, 0, sliceWidth, height, 0, 0)
     .putReadBuffer (clBufferC, true);

     return clBufferC.getBuffer ();
  } /*** execute ***/
} /*** Inner Class: CLTestTask ***/

can you put the resulting image please? – huseyin tugrul buyukisik Mar 08 '17 at 21:48 — huseyin tugrul buyukisik, Mar 08 '17 at 21:48

huseyin tugrul buyukisik · Accepted Answer · 2017-03-09T10:34:04.370

You are creating sub-buffers with

subBufferC = Buffers.slice (dataC, i * sliceSize, sliceSize);

and they have memory data as:

0 1 3  10 11 12  19 20 21  28 29 30
4 5 6  13 14 15  22 23 24  31 32 33
7 8 9  16 17 18  25 26 27  34 35 36

by using rectangle copy commands of opencl? If so, then you are accessing them out-of-bounds with

output [iy * width + ix] = iteration;

because width is bigger than sliceWidth and writes to out bounds in the kernel.

If you are not doing rectangle copies or subbuffers and simply taking an offset from the original buffer, then it has a memory layout like

 0  1  3  4  5  6  7  8  9 | 10 11 12
 13 14 15 16 17 18|19 20 21  22 23 24
 25 26 27|28 29 30 31 32 33  34 35 36

so the arrays are accessed/interpreted as skewed or computed wrong.

You are giving offset as a parameter of kernel. But you could do it from the kernel enqueue parameters too. So i and j would start from their true values(instead of zero) and you wouldn't need to add x0 or y0 to them in kernel for all threads.

I've written a multi device api before. It is using multiple buffers, one for each device and they are all equal in size to main buffer. And they just copy the necessary parts(their own territory) to/from main buffer(host buffer) so kernel calculations stay totally same with all devices, with using proper global range offsets. Bad side of this is, main buffer is literally duplicated on all devices. If you have 4 gpus and 1GB data, you need 4GB buffer area in total. But this way, kernel ingredients are much more easier to read, no matter how many devices are being used.

If you allocate only 1/N sized buffers per device(out of N devices), then you need to copy from 0th address of subbuffer to i*sliceHeight of main buffer where i is device index, considering arrays are flat so need rectangle buffer copy command of opencl api for each device. I suspect you are using flat arrays too and using rectangle copies and overflowing-out-of-bounds in the kernel. Then I suggest:

delete any device-related offsets and parameters from kernel
add necessary offsets into kernel enqueue parameters, not arguments
duplicate main buffer on each device, if you havent done already
copy only necessary parts related to devices (continuous if flat array division, rectangle copies for 2D interpretation/division of array)

if whole data can't fit in a device, you can try mapping/unmapping so it doesn't allocate much in background. In its page it says:

Multiple command-queues can map a region or overlapping regions of a memory object for reading (i.e. map_flags = CL_MAP_READ). The contents of the regions of a memory object mapped for reading can also be read by kernels executing on a device(s). The behavior of writes by a kernel executing on a device to a mapped region of a memory object is undefined. Mapping (and unmapping) overlapped regions of a buffer or image memory object for writing is undefined.

and it doesn't say, "non-overlapped mappings for read/write are undefined" so you should be okay to have mappings on each device for a concurrent read/write on target buffer. But when used with USE_HOST_PTR flag(for max streaming performance), each subbuffer may need to have an aligned pointer to start with, which could make it harder to split area into proper chunks. I'm using same whole data array for all devices so its not a problem to divide work since I can map unmap any address within an aligned buffer.

Here is 2-device result with 1-D division(upper part by cpu, lower part by gpu):

and this is inside of kernel:

    unsigned ix = get_global_id (0)%w2;
     unsigned iy = get_global_id (0)/w2;

        if (ix >= w2) return;
        if (iy >= h2) return;

        double r = ix * 0.001;
        double i = iy * 0.001;

        double x = 0;
        double y = 0;

        double magnitudeSquared = 0;
        int iteration = 0;

        while (magnitudeSquared < 4 && iteration < 255) 
        {
            double x2 = x*x;
            double y2 = y*y;
            y = 2 * x * y + i;
            x = x2 - y2 + r;
            magnitudeSquared = x2+y2;
            iteration++;
        }

        b[(iy * w2 + ix)]   =(uchar4)(iteration/5.0,iteration/5.0,iteration/5.0,244);

Took 17ms with FX8150(7 cores at 3.7GHz) + R7_240 at 700 MHz for a 512x512 sized image(8 bit per channel + alpha)

Also having subbuffers equal size to host buffer makes it faster(no re-allocations) to use dynamic ranges rather than static(in case of heterogeneous setup, dynamic turbo frequencies and hiccups/throttles), to help dynamic load balancing. Combined with power of "same codes same parameters", it doesn't incur performance penalty. For example, c[i]=a[i]+b[i] would need c[i+i0]=a[i+i0]+b[i+i0] to work on multiple devices if all kernels start from zero and would add more cycles(apart from memory bottleneck and readability and weirdness of distributing c=a+b).

Whew! Thanks very much for your respons! That takes some time to study :-), so I'll be back to you tomorrow, because it really will take some time. Your situation is the same as mine so I expect your answer will help me. Thanks again! I added a picture of both the single device and the multi device solution. See the Edit. — Arnold, Mar 09 '17 at 20:48
Pictures seem like skewed readings has happened. How did you read buffers? Rectangle or linear(skewed results)? I mean the opencl api enqueue commands. — huseyin tugrul buyukisik, Mar 09 '17 at 21:14
Picture is 1018x510 so are you using 16 as local workgroup size(or 4 x 4)? — huseyin tugrul buyukisik, Mar 09 '17 at 21:21
Images are 1024 x 512. The weird sizes come from not very precise selecting from the screen. — Arnold, Mar 10 '17 at 06:22
As to how I read the buffers: that is described in the last lines of the first code sample, and in the last line of the kernel code. According to me, that's the right question and I don't yet know how to do that correctly. But your answer gave me some clues so I'll be working on it today. — Arnold, Mar 10 '17 at 06:45
so its not a square copy you were doing. It should be fixed if you try that then. also kernel enqueue (or kernel parameter) needs a touch as I said. You don't have to use duplicated buffers, its just a suggestion — huseyin tugrul buyukisik, Mar 10 '17 at 10:55
You set me on the right track, thanks very much! I found the error, but the multi device version is yet inefficient. Thanks for sharing your multi device architecture. I'll see how that applies to my situation. — Arnold, Mar 12 '17 at 07:21
That inefficiency may be another question if it had some error :P — huseyin tugrul buyukisik, Mar 12 '17 at 07:34
I already know some errors in the algorithm and I have to try out some improved algorithms. That'll take some time I'm afraid :-). If I don't get it I'll be back again for a new question. Thanks huseyin. — Arnold, Mar 14 '17 at 19:23

OpenCL How to reconstruct buffers when using multiple devices?

1 Answers1