4

I am using C# and CUDAfy.net (yes, this problem is easier in straight C with pointers, but I have my reasons for using this approach given the larger system).

I have a video frame grabber card that is collecting byte[1024 x 1024] image data at 30 FPS. Every 33.3 ms it fills a slot in a circular buffer and returns a System.IntPtr that points to that un-managed 1D vector of *byte; The Circular buffer has 15 slots.

On the GPU device (Tesla K40) I want to have a global 2D array that is organized as a dense 2D array. That is, I want something like the Circular Queue but on the GPU organized as a dense 2D array.

byte[15, 1024*1024] rawdata; 
// if CUDAfy.NET supported jagged arrays I could use byte[15][1024*1024 but it does not

How can I fill in a different row each 33ms? Do I use something like:

gpu.CopyToDevice<byte>(inputPtr, 0, rawdata, offset, length) // length = 1024*1024
//offset is computed by  rowID*(1024*1024) where rowID wraps to 0 via modulo 15.
// inputPrt is the System.Inptr that points to the buffer in the circular queue (un-managed)?
// rawdata is a device buffer allocated gpu.Allocate<byte>(1024*1024);

And in my kernel header is:

[Cudafy]
public static void filter(GThread thread, byte[,] rawdata, int frameSize, byte[] result)

I did try something along these lines. But there is no API pattern in CudaFy for:

GPGPU.CopyToDevice(T) Method (IntPtr, Int32, T[,], Int32, Int32, Int32)

So I used the gpu.Cast Function to change the 2D device array to 1D.

I tried the code below, but I am getting CUDA.net exception: ErrorLaunchFailed

FYI: When I try the CUDA emulator, it aborts on the CopyToDevice claiming that Data is not host allocated

public static byte[] process(System.IntPtr data, int slot)
{
    Stopwatch watch = new Stopwatch();
    watch.Start();
    byte[] output = new byte[FrameSize];
    int offset = slot*FrameSize;
    gpu.Lock();
    byte[] rawdata = gpu.Cast<byte>(grawdata, FrameSize); // What is the size supposed to be? Documentation lacking
    gpu.CopyToDevice<byte>(data, 0, rawdata, offset, FrameSize * frameCount);
    byte[] goutput = gpu.Allocate<byte>(output);
    gpu.Launch(height, width).filter(rawdata, FrameSize, goutput);
    runTime = watch.Elapsed.ToString();
    gpu.CopyFromDevice(goutput, output);
    gpu.Free(goutput);
    gpu.Synchronize();
    gpu.Unlock();
    watch.Stop();
    totalRunTime = watch.Elapsed.ToString();
    return output;
}
krlzlx
  • 5,752
  • 14
  • 47
  • 55
Dr.YSG
  • 7,171
  • 22
  • 81
  • 139
  • It's difficult to understand what your question is, could you please reword or add details? If your concern is about your pointer arithmetic, it seems correct to me. – user703016 Dec 25 '14 at 17:56
  • I updated the story as it stands today. – Dr.YSG Dec 25 '14 at 19:03
  • What is the "CUDA emulator" mentioned in the question? – njuffa Dec 25 '14 at 19:37
  • CUDAfy.NET has an CPU based emulation mode, that you can use to debug the kernel, without going to NSight tools. It is useful for these sort of high level debugging. – Dr.YSG Dec 25 '14 at 19:52
  • Data is not host allocated typically refers to not having host-pinned memory when it is needed. You can check out the second to last post on https://cudafy.codeplex.com/discussions/352698 to see how they did it for an async copy. I'm not sure exactly where or why you would need host-pinned code, but that does seem to be the problem with the emulator. Are you getting the launch error from the kernel invocation? If you comment that line out are you error free? – Christian Sarofeen Jan 01 '15 at 17:34
  • I added the detail about the Emulator, since I was trying to figure out what was wrong with the Launch. So let me put a FYI, around the Emulator comment, since it is not germane to the issue. – Dr.YSG Jan 01 '15 at 18:31
  • When I comment the launch, then yes, there are no errors. But I still suspect the copy, since I am running the identical kernel with a different launch that copies ALL the data (not just one 1024x1024 image frame) and that works fine. Would it help to provide both launches and the kernel? – Dr.YSG Jan 01 '15 at 18:34
  • I think I made a dumb mistake, and I am passing the casted gdata[] array, instead of the gdata[,] array on launch. I only needed the gdata[] array for the CopyToDevice, so that I could offset. Stupid me. – Dr.YSG Jan 01 '15 at 18:46
  • As, I said, it was a dumb mistake on my part. I fixed it a week ago. – Dr.YSG Jan 08 '15 at 20:35

3 Answers3

1

I propose this "solution", for now, either: 1. Run the program only in native mode (not in emulation mode). or 2. Do not handle the pinned-memory allocation yourself.

There seems to be an open issue with that now. But this happens only in emulation mode.

see: https://cudafy.codeplex.com/workitem/636

  • I don' use emulation. I just wanted to test it for debug purposes. But as I said above, I have solved my issue. – Dr.YSG Mar 09 '15 at 09:07
1

If I understand your question properly I think you are looking to convert the
byte* you get from the cyclic buffer into a multi-dimensional byte array to be sent to
the graphics card API.

            int slots = 15;
            int rows = 1024;
            int columns = 1024;

//Try this
            for (int currentSlot = 0; currentSlot < slots; currentSlot++)
            {
                IntPtr intPtrToUnManagedMemory = CopyContextFrom(currentSlot);
                // use Marshal.Copy ?  
                byte[] byteData = CopyIntPtrToByteArray(intPtrToUnManagedMemory); 

                int offset =0;
                for (int m = 0; m < rows; m++)
                    for (int n = 0; n < columns; n++)
                    {
                        //then send this to your GPU method
                        rawForGpu[m, n] = ReadByteValue(IntPtr: intPtrToUnManagedMemory, 
                                                        offset++);
                    }
            }

//or try this
            for (int currentSlot = 0; currentSlot < slots; currentSlot++)
            {
                IntPtr intPtrToUnManagedMemory = CopyContextFrom(currentSlot);

                // use Marshal.Copy ?
                byte[] byteData = CopyIntPtrToByteArray(intPtrToUnManagedMemory); 

                byte[,] rawForGpu = ConvertTo2DArray(byteData, rows, columns);
            }
        }

        private static byte[,] ConvertTo2DArray(byte[] byteArr, int rows, int columns)
        {
            byte[,] data = new byte[rows, columns];
            int totalElements = rows * columns;
            //Convert 1D to 2D rows, colums
            return data;
        }

        private static IntPtr CopyContextFrom(int slotNumber)
        {
            //code that return byte* from circular buffer.
            return IntPtr.Zero;
        }
Vignesh.N
  • 2,618
  • 2
  • 25
  • 33
0

You should consider using the GPGPU Async functionality that's built in for a really efficient way to move data from/to host/device and use the gpuKern.LaunchAsync(...)

Check out http://www.codeproject.com/Articles/276993/Base-Encoding-on-a-GPU for an efficient way to use this. Another great example can be found in CudafyExamples project, look for PinnedAsyncIO.cs. Everything you need to do what you're describing.

This is in CudaGPU.cs in Cudafy.Host project, which matches the method you're looking for (only it's async):

public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, DevicePtrEx devArray,
                                  int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[, ,] devArray,
                                 int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[,] devArray,
                                  int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[] devArray,
                                  int devOffset, int count, int streamId = 0) where T : struct;
setigamer
  • 36
  • 5