Cuda passing char** to kernel

Question

I am having a spot of bother with this basic CUDA code.

I have a char** which is a flat 2d array of passwords, my current implementation is for CUDA simply to iterate through this list and display the passwords. However, when I go to display them I simply get "(NULL)". I'm not quite sure why this is. Can someone explain what it happening?

Main:

char ** pwdAry;
pwdAry = new char *[numberOfPwd];

//pwdAry given some values (flat 2d array layout)
const int pwdArySize = sizeof(pwdAry);    
dim3 grid(gridSize,gridSize);
dim3 block(blockSize,blockSize);

searchKeywordKernel << <grid, block >> >(pwdAry);

return EXIT_SUCCESS;

Cuda:

__global__ void searchKeywordKernel(char **passwordList)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int pitch = blockDim.x * gridDim.x;
    int idx = x + y * pitch;
    int tidy = idx / pitch;
    int tidx = idx - (pitch * tidy);
    int bidx = tidx / blockDim.x;
    int bidy = tidy / blockDim.y;
    int currentThread = threadIdx.x + blockDim.x * threadIdx.y;

    printf("hi, i am thread: %i, and my block x: %i, and y: %i\n", currentThread, bidx, bidy);
    printf("My password is: %s\n", passwordList[currentThread]);
}

`const int pwdArySize = sizeof(pwdAry);` That isn't going to give you the number of elements in an array, if that was your goal here. — PaulMcKenzie, Jan 10 '16 at 01:51
An array of pointers is not a flat array. I would start by understanding what a flattened array means in CUDA speak, and then convert your code to use a flat array. And if you want help with a code that is not working you are supposed to provide an MCVE. — Robert Crovella, Jan 10 '16 at 02:57
Is that not the same as what I am doing here to create essentially a 2d array? http://stackoverflow.com/questions/5397976/multidimensional-array-as-a-flat-array-problem — shorty, Jan 10 '16 at 09:37
It is illegal to pass `pwdAry` to a CUDA kernel because it contains pointers to host memory. You kernel never runs. If you have error checking in your code, or use cuda-memcheck you would already know this. This type of question comes up almost daily here, and there are a lot of working examples you can study on different approaches to solving this sort of problem, if you care to search for them. — talonmies, Jan 10 '16 at 09:58
ah i see thank you i didnt know that. I have tried copying it over to host memory first but that doesn't work either. I have searched all over stack overflow and google and no one seems to try to pass a char** to a kernel — shorty, Jan 10 '16 at 10:11
http://stackoverflow.com/a/6137517/681865 contains a complete working example. — talonmies, Jan 10 '16 at 11:25
There is also [this](http://stackoverflow.com/questions/19459788/2d-char-array-to-cuda-kernel/19463145#19463145) and [this](http://stackoverflow.com/questions/19759343/cuda-is-it-better-to-use-m-for-2d-static-arrays-or-flatten-them-to-m/19759898#19759898), and [this](http://stackoverflow.com/questions/23130728/how-to-pass-string-matrix-to-cuda-kernel-from-c/23137136#23137136) — Robert Crovella, Jan 10 '16 at 16:45
@RobertCrovella sorry but none of the links you provided actually answers my question of passing a char** to a kernel — shorty, Jan 10 '16 at 21:08
Sure, I didn't propose any of them as an "answer". What's listed here is "comments". Those links are all intended to point out various ways to handle array-of-char data (effectively multiple strings) without using a double-pointer method. If you want to use a double-pointer method, the canonical example is the one that @talonmies has already indicated. For beginners, it's not a recommended approach. It will also lead to more complex code, that is harder to maintain and more error-prone. And it may result in code that runs more slowly than data referenced via a single pointer. — Robert Crovella, Jan 10 '16 at 21:46
@RobertCrovella but surely there is no other way to create a dynamic 2d char array other than using char**? — shorty, Jan 10 '16 at 23:11
Perhaps not. My point is that the problem you are trying to solve (handle an array of passwords) could be solved in various ways. Certainly one method is via a char** array (what I call a "double-pointer" array). Other methods (as I indicated) could use a char* array, which is considerably easier to handle in CUDA. If you wish to use a char** array, it's possible; and the canonical approach is covered in the link provided by @talonmies. — Robert Crovella, Jan 10 '16 at 23:23
@RobertCrovella i can use a char* array ? can i reference and add strings like i would if it were a char**. i.e. can i do something like pwdAry[i] if it were a char*? If so that would solve all of my problems, if you could show me? — shorty, Jan 10 '16 at 23:29
@RobertCrovella even an example not directly related to mine would be much appreciated as im very stuck — shorty, Jan 10 '16 at 23:42
@talonmies... maybe its a device function using dynamic parallelism. :p — sgarizvi, Jan 11 '16 at 09:02

score 7 · Accepted Answer · edited May 23 '17 at 12:09

Based on discussion in the comments, here is an example code that roughly follows the code in the question, using 3 different methods:

Use a "flattened" array. This is the traditional advice for beginners who are asking about how to handle a double pointer array (char **, or any other type), or any data structure that contains embedded pointers. The basic idea is to create a single pointer array of the same type (e.g. char *), and copy all the data to that array, end-to-end. In this case, since the array elements are of variable length, we also need to pass an array containing the starting indices of each string (in this case).
Use a direct double-pointer method. I consider this code difficult to write. It may also have performance implications. The canonical example is here, and a stepwise description of what is required algorithmically is here and/or here is a 3D (i.e. triple-pointer) worked example with method description (yuck!). This is fundamentally doing a deep-copy in CUDA, and I consider it somewhat more difficult than typical CUDA coding.
Use the managed memory subsystem, that is available in CUDA platforms that support it. Coding-wise, this is probably simpler than either of the above 2 approaches.

Here is a worked example of all 3 methods:

$ cat t1035.cu
#include <stdio.h>
#include <string.h>

#define nTPB 256

__global__ void kern_1D(char *data, unsigned *indices, unsigned num_strings){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < num_strings)
    printf("Hello from thread %d, my string is %s\n", idx, data+indices[idx]);
}

__global__ void kern_2D(char **data, unsigned num_strings){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < num_strings)
    printf("Hello from thread %d, my string is %s\n", idx, data[idx]);
}

int main(){

  const int num_strings = 3;
  const char s0[] = "s1\0";
  const char s1[] = "s2\0";
  const char s2[] = "s3\0";
  int ds[num_strings];
  ds[0] = sizeof(s0)/sizeof(char);
  ds[1] = sizeof(s1)/sizeof(char);
  ds[2] = sizeof(s2)/sizeof(char);
  // pretend we have a dynamically allocated char** array
  char **data;
  data = (char **)malloc(num_strings*sizeof(char *));
  data[0] = (char *)malloc(ds[0]*sizeof(char));
  data[1] = (char *)malloc(ds[1]*sizeof(char));
  data[2] = (char *)malloc(ds[2]*sizeof(char));
  // initialize said array
  strcpy(data[0], s0);
  strcpy(data[1], s1);
  strcpy(data[2], s2);
  // method 1: "flattening"
  char *fdata = (char *)malloc((ds[0]+ds[1]+ds[2])*sizeof(char));
  unsigned *ind   = (unsigned *)malloc(num_strings*sizeof(unsigned));
  unsigned next = 0;
  for (int i = 0; i < num_strings; i++){
    strcpy(fdata+next, data[i]);
    ind[i] = next;
    next += ds[i];}
  //copy to device
  char *d_fdata;
  unsigned *d_ind;
  cudaMalloc(&d_fdata, next*sizeof(char));
  cudaMalloc(&d_ind, num_strings*sizeof(unsigned));
  cudaMemcpy(d_fdata, fdata, next*sizeof(char), cudaMemcpyHostToDevice);
  cudaMemcpy(d_ind, ind, num_strings*sizeof(unsigned), cudaMemcpyHostToDevice);
  printf("method 1:\n");
  kern_1D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_fdata, d_ind, num_strings);
  cudaDeviceSynchronize();
  //method 2: "2D" (pointer-to-pointer) array
  char **d_data;
  cudaMalloc(&d_data, num_strings*sizeof(char *));
  char **d_temp_data;
  d_temp_data = (char **)malloc(num_strings*sizeof(char *));
  for (int i = 0; i < num_strings; i++){
    cudaMalloc(&(d_temp_data[i]), ds[i]*sizeof(char));
    cudaMemcpy(d_temp_data[i], data[i], ds[i]*sizeof(char), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data+i, &(d_temp_data[i]), sizeof(char *), cudaMemcpyHostToDevice);}
  printf("method 2:\n");
  kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_data, num_strings);
  cudaDeviceSynchronize();
  // method 3: managed allocations
  // start over with a managed char** array
  char **m_data;
  cudaMallocManaged(&m_data, num_strings*sizeof(char *));
  cudaMallocManaged(&(m_data[0]), ds[0]*sizeof(char));
  cudaMallocManaged(&(m_data[1]), ds[1]*sizeof(char));
  cudaMallocManaged(&(m_data[2]), ds[2]*sizeof(char));
  // initialize said array
  strcpy(m_data[0], s0);
  strcpy(m_data[1], s1);
  strcpy(m_data[2], s2);
  // call kernel directly on managed data
  printf("method 3:\n");
  kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(m_data, num_strings);
  cudaDeviceSynchronize();

  return 0;
}


$ nvcc -arch=sm_35 -o t1035 t1035.cu
$ cuda-memcheck ./t1035
========= CUDA-MEMCHECK
method 1:
Hello from thread 0, my string is s1
Hello from thread 1, my string is s2
Hello from thread 2, my string is s3
method 2:
Hello from thread 0, my string is s1
Hello from thread 1, my string is s2
Hello from thread 2, my string is s3
method 3:
Hello from thread 0, my string is s1
Hello from thread 1, my string is s2
Hello from thread 2, my string is s3
========= ERROR SUMMARY: 0 errors
$

Notes:

I suggest running this code with cuda-memcheck if you are just testing it out for the first time. I have omitted proper cuda error checking for brevity of presentation, but I recommend it any time you are having trouble with a CUDA code. Proper execution of this code depends on having a managed memory subsystem available (read the doc links I have provided). If your platform does not support it, running this code as-is will probably result in a seg fault, because I have not included proper error checking.
Copying a double-pointer array from device to host, although not explicitly covered in this example, is essentially the reverse of the steps for each of the 3 methods. For method 1, a single cudaMemcpy call can do it. For method 2, it requires a for-loop that reverses the steps to copy to the device (including the use of the temp pointers). For method 3, nothing at all is required, other than proper adherence to managed memory coding practices, such as use of cudaDeviceSynchronize() after a kernel call, before attempting to access the device from host code again.
I don't wish to argue about whether or not methods 1 and 3 explicitly adhere to the letter of the question in terms of providing a method to pass a char ** array to a CUDA kernel. If your focus is that narrow, then please use method 2, or else disregard this answer entirely.

EDIT: Based on a question in the comments below, here is the above code modified with a different initialization sequence for the host-side strings (at line 42). There are now compilation warnings, but those warnings arise from the code specifically requested to be used by OP:

$ cat t1036.cu
#include <stdio.h>
#include <string.h>

#define nTPB 256

__global__ void kern_1D(char *data, unsigned *indices, unsigned num_strings){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < num_strings)
    printf("Hello from thread %d, my string is %s\n", idx, data+indices[idx]);
}

__global__ void kern_2D(char **data, unsigned num_strings){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < num_strings)
    printf("Hello from thread %d, my string is %s\n", idx, data[idx]);
}

int main(){

  const int num_strings = 3;
#if 0
  const char s0[] = "s1\0";
  const char s1[] = "s2\0";
  const char s2[] = "s3\0";
  int ds[num_strings];
  ds[0] = sizeof(s0)/sizeof(char);
  ds[1] = sizeof(s1)/sizeof(char);
  ds[2] = sizeof(s2)/sizeof(char);
  // pretend we have a dynamically allocated char** array
  char **data;
  data = (char **)malloc(num_strings*sizeof(char *));
  data[0] = (char *)malloc(ds[0]*sizeof(char));
  data[1] = (char *)malloc(ds[1]*sizeof(char));
  data[2] = (char *)malloc(ds[2]*sizeof(char));
  // initialize said array
  strcpy(data[0], s0);
  strcpy(data[1], s1);
  strcpy(data[2], s2);
#endif
  char ** pwdAry; pwdAry = new char *[num_strings]; for (int a = 0; a < num_strings; a++) { pwdAry[a] = new char[1024]; } for (int a = 0; a < 3; a++) { pwdAry[a] = "hello\0"; }
  // method 1: "flattening"
  char *fdata = (char *)malloc((1024*num_strings)*sizeof(char));
  unsigned *ind   = (unsigned *)malloc(num_strings*sizeof(unsigned));
  unsigned next = 0;
  for (int i = 0; i < num_strings; i++){
    memcpy(fdata+next, pwdAry[i], 1024);
    ind[i] = next;
    next += 1024;}
  //copy to device
  char *d_fdata;
  unsigned *d_ind;
  cudaMalloc(&d_fdata, next*sizeof(char));
  cudaMalloc(&d_ind, num_strings*sizeof(unsigned));
  cudaMemcpy(d_fdata, fdata, next*sizeof(char), cudaMemcpyHostToDevice);
  cudaMemcpy(d_ind, ind, num_strings*sizeof(unsigned), cudaMemcpyHostToDevice);
  printf("method 1:\n");
  kern_1D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_fdata, d_ind, num_strings);
  cudaDeviceSynchronize();
  //method 2: "2D" (pointer-to-pointer) array
  char **d_data;
  cudaMalloc(&d_data, num_strings*sizeof(char *));
  char **d_temp_data;
  d_temp_data = (char **)malloc(num_strings*sizeof(char *));
  for (int i = 0; i < num_strings; i++){
    cudaMalloc(&(d_temp_data[i]), 1024*sizeof(char));
    cudaMemcpy(d_temp_data[i], pwdAry[i], 1024*sizeof(char), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data+i, &(d_temp_data[i]), sizeof(char *), cudaMemcpyHostToDevice);}
  printf("method 2:\n");
  kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_data, num_strings);
  cudaDeviceSynchronize();
  // method 3: managed allocations
  // start over with a managed char** array
  char **m_data;
  cudaMallocManaged(&m_data, num_strings*sizeof(char *));
  cudaMallocManaged(&(m_data[0]), 1024*sizeof(char));
  cudaMallocManaged(&(m_data[1]), 1024*sizeof(char));
  cudaMallocManaged(&(m_data[2]), 1024*sizeof(char));
  // initialize said array
  for (int i = 0; i < num_strings; i++)
    memcpy(m_data[i], pwdAry[i], 1024);
  // call kernel directly on managed data
  printf("method 3:\n");
  kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(m_data, num_strings);
  cudaDeviceSynchronize();

  return 0;
}


$ nvcc -arch=sm_35 -o t1036 t1036.cu
t1036.cu(42): warning: conversion from a string literal to "char *" is deprecated

t1036.cu(42): warning: conversion from a string literal to "char *" is deprecated

$ cuda-memcheck ./t1036
========= CUDA-MEMCHECK
method 1:
Hello from thread 0, my string is hello
Hello from thread 1, my string is hello
Hello from thread 2, my string is hello
method 2:
Hello from thread 0, my string is hello
Hello from thread 1, my string is hello
Hello from thread 2, my string is hello
method 3:
Hello from thread 0, my string is hello
Hello from thread 1, my string is hello
Hello from thread 2, my string is hello
========= ERROR SUMMARY: 0 errors
$

Fantastic example so thank you! but if i have something like this: `char ** pwdAry; pwdAry = new char *[3]; for (int a = 0; a < 3; a++) { pwdAry[a] = new char[1024]; } for (int a = 0; a < 3; a++) { pwdAry[a] = "hello\0"; }` How would i then pass that in two your method 2 implementation? — shorty, Jan 11 '16 at 10:53
I think this is pretty much just a c/c++ programming question now. However I've added an additional code to my answer which shows your exact initialization code that you have provided in your comment. Note that there is now a compile-time warning. That arises out of the code you have indicated, so I'm not going to bother fixing that. — Robert Crovella, Jan 11 '16 at 14:37

Cuda passing char** to kernel

1 Answers1

Linked