CUDA: Modify mapped memory from host while kernel is executing

Question

I want to modify a piece of mapped memory from the host WHILE the kernel is executing, and then read this value from the kernel.

I'm trying to do this the following way. I have a kernel like this:

__global__ void kernel(int* d_ptr)
{
    *d_ptr = 1;
    while( *d_ptr);
}

d_ptr is mapped to a piece of memory accessible from the host.

I also have a host thread that looks like this

void run( void* input )
{
    int* h_ptr = (int*)input;
    while( kernel_running)
        *h_ptr = 0;
}

So, the host thread is repeatedly writing a 0 to a place that the kernel is repeatedly reading from until it sees a 0. Theoretically, the kernel should stop right after it reads the value written by the host thread. The problem is that the kernel never reads this 0, so it never terminates.

The weird part is that if I add a print statement in the kernel like this

__global__ void kernel(int* d_ptr)
{
    *d_ptr = 1;
    while( *d_ptr) printf("%d\n", *d_ptr);
}

Then it does read the 0 and terminates. I have no clue on what's happening. The cuda programming guide does not provide much information about concurrent access to mapped memory and it's been pretty hard to find a question that addresses this either. Any pointers? I'm using Windows and I'm compiling with

nvcc -g -arch=sm_20 -lineinfo

The whole code looks like this:

bool kernel_running = 0;

__global__ void kernel(int* d_ptr)
{
    *d_ptr = 1;
    while( *d_ptr) printf("%d\n", *d_ptr);
}

void run( void* input )
{
    int* h_ptr = (int*)input;
    while( kernel_running)
    {
        *h_ptr = 0;
    }
}

int main()
{
    // HOST AND DEVICE POINTERS
    int* h_ptr = 0;
    int* d_ptr = 0;

    // INITIALIZE POINTERS
    assert( cudaHostAlloc(&h_ptr, sizeof(int), cudaHostAllocMapped) == cudaSuccess);
    assert( cudaHostGetDevicePointer(&d_ptr, h_ptr, 0) == cudaSuccess);

    // RUN KERNEL
    kernel_running = 1;
    _beginthread( run, 0, h_ptr);
    kernel<<<1,1>>>(d_ptr);
    assert( cudaDeviceSynchronize() == cudaSuccess);
    kernel_running = 0;
}

score 0 · Accepted Answer · answered Nov 24 '13 at 06:10

recommendations:

add this before your call to cudaHostAlloc:
```
cudaSetDeviceFlags(cudaDeviceMapHost);
```
the documentation suggests this.

declare your flag variable as volatile:

__global__ void kernel(volatile int* d_ptr)

The following code works for me without the use of threads:

#include <stdio.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void mykernel(volatile int *data){

  while (*data) {};
  printf("finished\n");
}

int main(){

  int *d_data, *h_data;
  cudaSetDeviceFlags(cudaDeviceMapHost);
  cudaCheckErrors("cudaSetDeviceFlags error");
  cudaHostAlloc((void **)&h_data, sizeof(int), cudaHostAllocMapped);
  cudaCheckErrors("cudaHostAlloc error");
  cudaHostGetDevicePointer(&d_data, h_data, 0);
  cudaCheckErrors("cudaHostGetDevicePointer error");
  *h_data = 1;
  printf("kernel starting\n");
  mykernel<<<1,1>>>(d_data);
  cudaCheckErrors("kernel fail");
  getchar();
  *h_data = 0;
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail 2");
  return 0;
}

Thanks for the answer! I tried your code, but the value of *data in mykernel is 0 when it starts, so the 1 is not copied over for some reason for me... — santi, Nov 24 '13 at 06:20
Ok, I found that the problem I mentioned on my previous comment came from the fact that, since I'm using the WDDM Driver with Windows, the kernel doesn't start until cudaDeviceSynchronize(), at that point, *h_data is zero, so that's why I was reading zero right away from the kernel... I worked around this problem by triggering early launching right after the kernel call as explain here: http://stackoverflow.com/questions/13568805/cuda-kernels-not-launching-before-cudadevicesynchronize — santi, Nov 24 '13 at 07:15
Declaring flag variable as "volatile" worked for me at gtx1050 on compute capabilities 61. On 940mx and compute capabilities 50 - it worked without "volatile" word (visual studio 2022). — TheLV, Aug 25 '22 at 01:57

CUDA: Modify mapped memory from host while kernel is executing

1 Answers1