1

I want to calculate some stuff on the GPU in parallel and display the results in between every kernel call. The code would look something like this:

void execute(){
    runthread = true;
    float erg[128 * 2] = {};
    float *d_a, *d_b, *d_c, *d_erg;
    size_t sizeErg = sizeof(float) * 2 * N;
    size_t sizeAB = sizeof(float)*N;
    float c[2] = { 1, 2 };
    gpuErrchk(cudaMalloc((void**)&d_a, sizeAB));
    gpuErrchk(cudaMalloc((void**)&d_b, sizeAB));
    gpuErrchk(cudaMalloc((void**)&d_c, sizeof(float) * 2));
    gpuErrchk(cudaMalloc((void**)&d_erg, sizeErg));

    gpuErrchk(cudaMemcpy(d_a, anode, sizeAB, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_b, kathode, sizeAB, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_c, c, 2 * sizeof(float), cudaMemcpyHostToDevice));


    float time = 0;
    int i = 0;
    while (runthread){
        kernel<<<(N * 2) / 64, 64 >>>(d_a, d_b, d_c, d_erg, N);
        cudaDeviceSynchronize();
        gpuErrchk(cudaMemcpy(erg, d_erg, sizeErg, cudaMemcpyDeviceToHost));

        float acc = 0;
        for (int j = 0; j < N * 2; j++){
            acc += erg[j];
        }
        std::cout << "Erg" << i << "=" << acc << std::endl;
        std::cout << "Kernel Execution took" << time << "ms" << std::endl;
        i++;
    }
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaFree(d_erg);
}

This function is a part of a class with the bool variable runthread. My Idea was to call another memberfunction that will start a new CPU thread with the execution function and wait in the main function till the user inputs something to call another memberfunction that sets runthreads =false. So the Thread would finish after the next Kernel is finished. I always get error messages from Visual Studio. Now I was wondering if this was even possible, or is the CPU busy with controlling the GPU execution? Has someone expirience in multithreading regarding parallel execution on the GPU and CPU? Or should I just look for userinput in the while loop?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Andossus
  • 55
  • 6

1 Answers1

3

Executions on the GPU are asynchronous with respect to the execution on CPU. Aside from wait operations, you may continue processing on the CPU. Also depending on configuration flags see cudaSetDeviceFlags , the wait operation will use or not CPU cycles.

cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can decrease latency when waiting for the device, but may lower the performance of CPU threads if they are performing work in parallel with the CUDA thread.

What you want to achieve is totally feasible (here an example on Windows):

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

volatile int runthread ;

__global__ void kernel() { }

#include <Windows.h>


int execute(void* p)
{
    int count = 0 ;
    while (runthread)
    {
        kernel<<<1,1>>>();
        cudaDeviceSynchronize();
        ++count;
    }
    printf ("Executed kernel %d times\n", count);
    ::ExitThread(count);
    return count ;
}

int main()
{
    runthread = 1 ;

    HANDLE hThread = ::CreateThread (0, 0, (LPTHREAD_START_ROUTINE)execute, 0, 0, 0) ;

    printf ("Press key\n") ;
    int c = getc(stdin);

    printf ("Stopping\n") ;

    runthread = 0 ;

    ::WaitForSingleObject (hThread, INFINITE) ;

    printf ("DONE\n");
    return 0 ;
}

However, you want to be careful on the thread executing cuda calls as some cuda configuration and status elements are stored per thread. If you want to use cuda from different threads, I recommend this post. In essence, you want to attach the cuda environment to the thread using cuCtxSetCurrent API call. Easiest is to have all of your cuda code executed by a single thread.

Community
  • 1
  • 1
Florent DUGUET
  • 2,786
  • 16
  • 28