0

I'm new to CUDA and trying to figure out how to pass 2d array to the kernel. I have to following working code for 1 dimension array:

class Program
{
    static void Main(string[] args)
    {
        int N = 10;
        int deviceID = 0;
        CudaContext ctx = new CudaContext(deviceID);
        CudaKernel kernel = ctx.LoadKernel(@"doubleIt.ptx", "DoubleIt");
        kernel.GridDimensions = (N + 255) / 256;
        kernel.BlockDimensions = Math.Min(N,256);

        // Allocate input vectors h_A in host memory
        float[] h_A = new float[N];

        // Initialize input vectors h_A
        for (int i = 0; i < N; i++)
        {
            h_A[i] = i;
        }

        // Allocate vectors in device memory and copy vectors from host memory to device memory 
        CudaDeviceVariable<float> d_A = h_A;
        CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);

        // Invoke kernel
        kernel.Run(d_A.DevicePointer, d_C.DevicePointer, N);

        // Copy result from device memory to host memory
        float[] h_C = d_C;
        // h_C contains the result in host memory
    }
}

with the following kernel code:

__global__ void DoubleIt(const float* A, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] * 2;
}

as I said, everything works fine but I want to work with 2d array as follow:

// Allocate input vectors h_A in host memory
int W = 10;
float[][] h_A = new float[N][];

// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
    h_A[i] = new float[W];
    for (int j = 0; j < W; j++)
    {
        h_A[i][j] = i*W+j;
    }
}

I need all the 2nd dimension to be on the same thread so the kernel.BlockDimensions must stay as 1 dimension and each kernel thread need to get 1d array with 10 elements.

so my bottom question is: How shell I copy this 2d array to the device and how to use it in the kernel? (as to the example it supposed to have total of 10 threads).

too honest for this site
  • 12,050
  • 4
  • 30
  • 52
TVC
  • 452
  • 1
  • 6
  • 12
  • I have been working with Cudafy for years now, and have the same problem. To my knowledge (and I could be wrong), there is not a current managed-to-CudaC transpiler that supports Jagged Arrays. They don't deal with pointers correctly. Managed Cuda, which I'm not familiar with, might handle it differently. With Cudafy, you CAN write your own Cuda C and load it. To understand the allocation issue, try this: https://stackoverflow.com/questions/1047369/allocate-2d-array-on-device-memory-in-cuda – bopapa_1979 May 24 '17 at 16:34
  • I saw this conversation before but didn't really understood it... it says something about `cudaMemcpy2D()` but it's not implemented in the code. anyway my code is written in c# and I want to find solution in c# cause all my program already written in it. I know threre are solutions in c/c++ but it's not really help me and i couldn't translate it to the managedCuda. Any clue will be helpful. – TVC May 24 '17 at 20:05

1 Answers1

2

Short answer: you shouldn't do it...

Long answer: Jagged arrays are difficult to handle in general. Instead of one continuous segment of memory for your data, you have plenty small ones lying sparsely somewhere in your memory. What happens if you copy the data to GPU? If you had one large continuous segment you call the cudaMemcpy/CopyToDevice functions and copy the entire block at once. But same as you allocate jagged arrays in a for loop, you’d have to copy your data line by line into a CudaDeviceVariable<CUdeviceptr>, where each entry points to a CudaDeviceVariable<float>. In parallel you maintain a host array CudaDeviceVariable<float>[] that manages your CUdeviceptrs on host side. Copying data in general is already quite slow, doing it this way is probably a real performance killer...

To conclude: If you can, use flattened arrays and index the entries with index y * DimX + x. Even better on GPU side, use pitched memory, where the allocation is done so that each line starts on a "good" address: Index then turns to y * Pitch + x (simplified). The 2D copy methods in CUDA are made for these pitched memory allocations where each line gets some additional bytes added.

For completeness: In C# you also have 2-dimensional arrays like float[,]. You can also use these on host side instead of flattened 1D arrays. But I wouldn’t recommend to do so, as the ISO standard of .net does not guarantee that the internal memory is actually continuous, an assumption that managedCuda must use in order to use these arrays. Current .net framework doesn’t have any internal weirdness, but who knows if it will stay like this...

This would realize the jagged array copy:

float[][] data_h;
CudaDeviceVariable<CUdeviceptr> data_d;
CUdeviceptr[] ptrsToData_h; //represents data_d on host side
CudaDeviceVariable<float>[] arrayOfarray_d; //Array of CudaDeviceVariables to manage memory, source for pointers in ptrsToData_h.

int sizeX = 512;
int sizeY = 256;

data_h = new float[sizeX][];
arrayOfarray_d = new CudaDeviceVariable<float>[sizeX];
data_d = new CudaDeviceVariable<CUdeviceptr>(sizeX);
ptrsToData_h = new CUdeviceptr[sizeX];
for (int x = 0; x < sizeX; x++)
{
    data_h[x] = new float[sizeY];
    arrayOfarray_d[x] = new CudaDeviceVariable<float>(sizeY);
    ptrsToData_h[x] = arrayOfarray_d[x].DevicePointer;
    //ToDo: init data on host...
}
//Copy the pointers once:
data_d.CopyToDevice(ptrsToData_h);

//Copy data:
for (int x = 0; x < sizeX; x++)
{
    arrayOfarray_d[x].CopyToDevice(data_h[x]);
}

//Call a kernel:
kernel.Run(data_d.DevicePointer /*, other parameters*/);

//kernel in *cu file:
//__global__ void kernel(float** data_d, ...)

This is a sample for CudaPitchedDeviceVariable:

int dimX = 512;
int dimY = 512;
float[] array_host = new float[dimX * dimY];
CudaPitchedDeviceVariable<float> arrayPitched_d = new CudaPitchedDeviceVariable<float>(dimX, dimY);
for (int y = 0; y < dimY; y++)
{
    for (int x = 0; x < dimX; x++)
    {
        array_host[y * dimX + x] = x * y;
    }
}

arrayPitched_d.CopyToDevice(array_host);
kernel.Run(arrayPitched_d.DevicePointer, arrayPitched_d.Pitch, dimX, dimY);

//Correspondend kernel:
extern "C"
__global__ void kernel(float* data, size_t pitch, int dimX, int dimY)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= dimX || y >= dimY)
        return;

    //pointer arithmetic: add y*pitch to char* pointer as pitch is given in bytes,
    //which gives the start of line y. Convert to float* and add x, to get the
    //value at entry x of line y:
    float value = *(((float*)((char*)data + y * pitch)) + x);

    *(((float*)((char*)data + y * pitch)) + x) = value + 1;

    //Or simpler if you don't like pointers:
    float* line = (float*)((char*)data + y * pitch);
    float value2 = line[x];
}
kunzmi
  • 1,024
  • 1
  • 6
  • 8
  • Very helpful, thanks. My C# program is using the floats number as List of Lists of floats `List>` , so I need to convert it to array anyway, So as your advice, using 2d array will kill the performance so I will take the second choice, using the pitched memory. Can you please show also the example for using it in my case. – TVC Jun 01 '17 at 20:05
  • I added it to the answer. – kunzmi Jun 01 '17 at 21:46
  • Thanks, Working good. I flattened the array as you recommended. After few times of running about 6M arrays (of Arrays) [6M][31] I'm getting a memory exception: **ErrorOutOfMemory** . Shall I manually release the allocated pitched memory? I'm comparing those 6M arrays with single array each time, Can I reuse the big array without copy it over and over again each time? – TVC Jun 06 '17 at 15:03
  • Memory allocations on GPU are not managed in any way, you're responsible to free the no more used allocations manually by calling the Dispose() method of the CudaDeviceVariable. But instead of allocating and freeing non-stop memory (expansive and slow), allocate once and reuse it as often you need need it. – kunzmi Jun 06 '17 at 22:10
  • Perfect, just edited your example for using the release memory, I guess it's just for the GPU allocated memory and not for the host side. Please confirm my edits – TVC Jun 07 '17 at 06:30
  • There is a little bug in your kernel example, the y dimension is useless (always 0) because it's not 2d block/s, it is 1d block which means you have to multiply the pitch by x. the **+x** in the end supposed to be used as for loop from 0 to dimX and this var should be add(let's say **+n** instead of the **+x**) – TVC Jun 07 '17 at 21:01
  • The example is meant to give you an idea on how to use pitched memory for some 2D coordinates (x,y) in the general use case. If in your very specific special case the default 2D grid/block layout for this task doesn’t fit, it is still straight forward to adopt the example for how you get your two indices x and y (or (x, n), or whatever you'd call them) ... I’ll keep the example as is in the most basic usage scenario to keep the answer more useful for other readers. – kunzmi Jun 08 '17 at 09:27