I want to achieve the effect of the below code, which means using flags to control kernel behavior from the host. So far the flags allocated by unified memory worked as I expected, but when I want to update data from the host and copy it to the device, it does not work.
So my question is, could CUDA achieve this effect, that is, update data from the host and copy it to an executing device side kernel function, and then informed the kernel to process the data by updating a data-ready flag?
More details
cudaMemcpy:
When I use
cudaMemcpy
, thedata_ready
flag could not be changed and kept printingx
.cudaMemcpyAsync:
While using
cudaMemcpyAsync
to copy the updated data, the program can finish since thedata_ready
could be changed, but the value ofdata
remains the same.Unified memory for data:
I also think about using unified memory for my
data
, but the size of the data could be really large (more than 1GB) in a more complex scenario, and I don't think my unified memory could take that.
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
using namespace std;
__global__ void test (int *flag, int *data_ready, int *data) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (true) {
if (*flag == 0) {
// wait for data transfer
while (true) {
if (*data_ready == 0) {
printf("x");
}
else {
break;
}
}
printf("data %d\n", *data);
__syncthreads();
}
else {
break;
}
}
printf("gpu finish %d\n", tid);
}
int main() {
// flags
int *flag;
cudaMallocManaged(&flag, sizeof(int));
*flag = 0;
int *data_ready;
cudaMallocManaged(&data_ready, sizeof(int));
*data_ready = 0;
// data
int *data = (int *)malloc(sizeof(int));
int *data_device;
*data = 777;
cudaMalloc(&data_device, sizeof(int));
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
// launch kernel
int block = 8, grid = 1;
test<<<grid, block>>> (flag, data_ready, data_device);
// random host code
for (int i = 0; i < 1e5; i++);
printf("host do something\n");
// update data
*data = 987;
cudaMemcpy(data_device, data, sizeof(int), cudaMemcpyHostToDevice);
printf("host copied\n");
*data_ready = 1;
// update flag
*flag = 1;
cudaDeviceSynchronize();
// free memory
cudaFree(flag);
printf("host finish\n");
}