1

I am trying to print the elements of a String vector passed as argument of a kernel funcion, using cuPrint function.

The code of the kernel

__global__ void testKernel(string wordList[10000])
{
    //access thread id
    const unsigned int bid = blockIdx.x;
    const unsigned int tid = threadIdx.x;
    const unsigned int index = bid * blockDim.x + tid;


    cuPrintf("wordList[%d]: %s \n", index, wordList[index]);
}

Code from main function to setup execution parameters and launch the kernel

//Allocate device memory for word list
    string* d_wordList;
    cudaMalloc((void**)&d_wordList, sizeof(string)*number_of_words);

    //Copy word list from host to device
    cudaMemcpy(d_wordList, wordList, sizeof(string)*number_of_words, cudaMemcpyHostToDevice);

    //Setup execution parameters
    int n_blocks = (number_of_words + 255)/256;
    int threads_per_block = 256;

    dim3 grid(n_blocks, 1, 1);
    dim3 threads(threads_per_block, 1, 1);

    cudaPrintfInit();
    testKernel<<<grid, threads>>>(d_wordList);
    cudaDeviceSynchronize();
    cudaPrintfDisplay(stdout,true);
    cudaPrintfEnd();

I am getting the error: "Error 44 error : calling a host function("std::basic_string, std::allocator >::~basic_string") from a global function("testKernel") is not allowed D:...\kernel.cu 44 1 CUDA_BF_large_word_list "

What have I missed? Thanks in advance.

Alex Iacob
  • 33
  • 3
  • 12

2 Answers2

1

In general, you can't use functions from C++ libraries (including <string>) in CUDA device code.

Use an array of char instead to hold your string(s).

Here is an example of manipulating "strings" as C-style arrays of null-terminated char, and passing them to a kernel.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I am reading words from a txt file like this //build array of strings containing words from text file string wordList[10000]; if(file.is_open()) { for(int i=0; i>wordList[i]; //cout< – Alex Iacob Sep 22 '14 at 13:30
  • provided a link to example code in my answer that shows how to manipulate C-style strings. I assume you can handle the file I/O. That is not CUDA-specific. – Robert Crovella Sep 22 '14 at 15:20
  • Yes, there's no problem handling the file I/O. Thanks! – Alex Iacob Sep 23 '14 at 06:50
0

I modified the code, and used an array of char insted of strings.

The updated version of kernel is:

__global__ void testKernel(char* d_wordList)
{
    //access thread id
    const unsigned int bid = blockIdx.x;
    const unsigned int tid = threadIdx.x;
    const unsigned int index = bid * blockDim.x + tid;


    //cuPrintf("Hello World from kernel! \n");


            cuPrintf("!! %c%c%c%c%c%c%c%c%c%c \n" , d_wordList[index * 20 + 0],
                                                    d_wordList[index * 20 + 1],
                                                    d_wordList[index * 20 + 2],
                                                    d_wordList[index * 20 + 3],
                                                    d_wordList[index * 20 + 4],
                                                    d_wordList[index * 20 + 5],
                                                    d_wordList[index * 20 + 6],
                                                    d_wordList[index * 20 + 7],
                                                    d_wordList[index * 20 + 8],
                                                    d_wordList[index * 20 + 9]);


}

I am also wondering if there is an easier way to print the words from the char array. (Bassically I need to print and later work with one word per kernel function).

The code from the main function is:

         const int text_length = 20;

         char (*wordList)[text_length] = new char[10000][text_length];
         char *dev_wordList;

         for(int i=0; i<number_of_words; i++)
         {
             file>>wordList[i];
             cout<<wordList[i]<<endl;
         }

         cudaMalloc((void**)&dev_wordList, 20*number_of_words*sizeof(char));
         cudaMemcpy(dev_wordList, &(wordList[0][0]), 20 * number_of_words * sizeof(char), cudaMemcpyHostToDevice);

         char (*resultWordList)[text_length] = new char[10000][text_length];

         cudaMemcpy(resultWordList, dev_wordList, 20 * number_of_words * sizeof(char), cudaMemcpyDeviceToHost);

         for(int i=0; i<number_of_words; i++)
             cout<<resultWordList[i]<<endl;

        //Setup execution parameters
        int n_blocks = (number_of_words + 255)/256;
        int threads_per_block = 256;


        dim3 grid(n_blocks, 1, 1);
        dim3 threads(threads_per_block, 1, 1);

 cudaPrintfInit();
        testKernel<<<grid, threads>>>(dev_wordList);
        cudaDeviceSynchronize();
        cudaPrintfDisplay(stdout,true);
        cudaPrintfEnd();

If I use smaller values for number of blocks/ threads like this:

 dim3 grid(20, 1, 1);
 dim3 threads(100, 1, 1);

The Kernel launch is correct, it displays one word per thread. But I need this procedure for 10000 words. What have I missed?

Alex Iacob
  • 33
  • 3
  • 12
  • Posting an answer to your own question and using it to ask a new question is probably not a good idea. It's not really how SO works. If you have a new question, it's recommended that you ask a new question. Note that to me, your final question here is unclear. What is not working specifically? Are you aware of things like the threads per block limitation? Are you aware of the fact that printf from a kernel is limited in terms of the amount of output it can generate? What actually is not working? (post a new question) – Robert Crovella Sep 23 '14 at 21:04
  • Ok, thanks for advice. I know about threads per block limitation, in my case the number of threads per block is 512. The problem was that for bigger parameters of number of grid/threads the kernel doesn't output, but the problem may be the cuPritf function limitation. – Alex Iacob Sep 24 '14 at 07:06
  • I investigated the problem and the cause was that cuPrintf is limited to grids of up to 2048 threads. – Alex Iacob Sep 24 '14 at 07:53