deep copy of structs to and from device memory

Question

I am having trouble with the deep copy of an array of structs with dynamically allocated member variables in this cuda code. I think it is occurring because &deviceHistogram points to an address on the host instead of an address on the device. I tried making an intermediate pointer variable as in here, but that did not work; how do I properly copy this entire array of structs so I can modify it from the makeHistogram function?

#include <stdlib.h>
#include <stdio.h>
#include "cuda.h"

typedef struct histogramBin {
    int* items;
    int count;
} histogramBin;

__host__ __device__ void outputHistogram(histogramBin* histogram, int size) {
    for (int i = 0; i < size; i++) {
        printf("%d: ", i);
        if (!histogram[i].count) {
            printf("EMPTY");
        } else {
            for (int j = 0; j < histogram[i].count; j++) {
                printf("%d ", histogram[i].items[j]);
            }
        }
        printf("\n");
    }
}


// This function embeds PTX code of CUDA to extract bit field from x. 
   __device__ uint bfe(uint x, uint start, uint nbits) {
    uint bits;
    asm("bfe.u32 %0, %1, %2, %3;"
        : "=r"(bits)
        : "r"(x), "r"(start), "r"(nbits));
    return bits;
}

__global__ void makeHistogram(histogramBin** histogram, int* rH, int rSize, int bit) {
    for (int r = 0; r < rSize; r++) {
        int thisBin = bfe(rH[r], bit, 1);
        int position = (*histogram)[thisBin].count; // **** out of memory access here****
        (*histogram)[thisBin].items[position] = rH[r];
        (*histogram)[thisBin].count++;
    }
}

void histogramDriver(histogramBin* histogram, int* rH, int rSize, int bit) {
    int n = 8;
    int* deviceRH;
    histogramBin* deviceHistogram;

    cudaMalloc((void**)&deviceRH, rSize * sizeof(int));
    cudaMemcpy(deviceRH, rH, rSize * sizeof(int), cudaMemcpyHostToDevice);

    cudaMalloc((void**)&deviceHistogram, n * sizeof(histogramBin));
    cudaMemcpy(deviceHistogram, histogram, n * sizeof(histogramBin), cudaMemcpyHostToDevice);

    int* tempData[n];
    for (int i = 0; i < n; i++) {
        cudaMalloc(&(tempData[i]), rSize * sizeof(int));
    }
    for (int i = 0; i < n; i++) {
        cudaMemcpy(&(deviceHistogram[i].items), &(tempData[i]), sizeof(int*), cudaMemcpyHostToDevice);
    }
    for (int i = 0; i < n; i++) {
        cudaMemcpy(tempData[i], histogram[i].items, rSize * sizeof(int), cudaMemcpyHostToDevice);
    }

    makeHistogram<<<1, 1>>>(&deviceHistogram, deviceRH, rSize, bit);
    cudaDeviceSynchronize();
}


int main(){
    int rSize = 5;
    int rH[rSize] = {1, 2, 3, 4, 5};

    histogramBin * histogram = (histogramBin*)malloc(sizeof(histogramBin) * 8);
    for(int i = 0; i < 8; i++){
        histogram[i].items = (int*)calloc(sizeof(int), rSize);
        histogram[i].count = 0;
    }
    histogramDriver(histogram, rH, rSize, 0);
    return 0;
}

Once it has been copied properly to the device, how do I get it back on the host? For example, if I call outputHistogram(histogram, 5); from inside makeHistogram, I see the following:

0: 2 4 
1: 1 3 5 
2: EMPTY
3: EMPTY
4: EMPTY
5: EMPTY
6: EMPTY
7: EMPTY

Which is the output I am expecting.

When I call outputHistogram(histogram, 8) from histogramDriver (after the cudaDeviceSynchronize()) I see the following:

0: EMPTY
1: EMPTY
2: EMPTY
3: EMPTY
4: EMPTY
5: EMPTY
6: EMPTY
7: EMPTY

Clearly I am not properly copying the values back from the device to the host.

I have tried copying by doing the reverse procedure from the one in histogramDriver:

for(int i = 0; i < n; i++){
    cudaMemcpy(&(tempData[i]), &(deviceHistogram[i].items), sizeof(int*), cudaMemcpyDeviceToHost);
}
for (int i = 0; i < n; i++) {
    cudaMemcpy(histogram[i].items, tempData[i], rSize * sizeof(int), cudaMemcpyDeviceToHost);
}

But the output from the outputHistogram call in histogramDriver remains unchanged.

This is a deep copy and there are many questions here on the `cuda` tag that discuss it. [Here](https://stackoverflow.com/a/15435592/1695960) is one answer that lays out the steps and links to several examples. [This](https://stackoverflow.com/questions/30082991/memory-allocation-on-gpu-for-dynamic-array-of-structs) is a follow-up question and answer. [Here](https://stackoverflow.com/questions/45643682/cuda-using-2d-and-3d-arrays/45644824#45644824) is another answer that discusses various approaches and links to several examples. Refer to 2D allocation. — Robert Crovella, Jul 18 '19 at 01:51
Note item 1 [here](https://stackoverflow.com/help/on-topic) You are supposed to provide a [mcve] What you have shown is not one. It should be a complete code. — Robert Crovella, Jul 18 '19 at 01:58
`makeHistogram<<<1, 1>>>(&deviceHistogram, .....` is plainly wrong. What are you hoping to achieve by passing the address of a host variable to the kernel? — talonmies, Jul 18 '19 at 01:59
@RobertCrovella I have edited to add an MVE. Thank you for the pointers on terminology and the references! — Luciano, Jul 18 '19 at 05:11
The basic problem here is the design of your kennel, not how you are copying the days. I see no valid reason for `histogramBin** histogram`. Why is it necessary to pass the address of a pointer to that kennel? — talonmies, Jul 18 '19 at 08:11
@talonmies I don’t know if it’s necessary, but I want to make the histogram and then use the values in it elsewhere; I thought I had to pass an array by reference to do so. — Luciano, Jul 18 '19 at 08:15

score 2 · Accepted Answer · answered Jul 18 '19 at 14:52

As @talonmies indicated, the biggest problem here is the design of your kernel. There is no reason/need to use a double-pointer for histogram (and indeed, the first iteration of the code you posted did not have that in the kernel prototype, although it was incomplete).

By removing the double-pointer aspect, your code runs without any runtime errors.

#include <stdlib.h>
#include <stdio.h>
#include "cuda.h"

typedef struct histogramBin {
    int* items;
    int count;
} histogramBin;

// This function embeds PTX code of CUDA to extract bit field from x.
   __device__ uint bfe(uint x, uint start, uint nbits) {
    uint bits;
    asm("bfe.u32 %0, %1, %2, %3;"
        : "=r"(bits)
        : "r"(x), "r"(start), "r"(nbits));
    return bits;
}

__global__ void makeHistogram(histogramBin* histogram, int* rH, int rSize, int bit) {
    for (int r = 0; r < rSize; r++) {
        int thisBin = bfe(rH[r], bit, 1);
        int position = histogram[thisBin].count; 
        histogram[thisBin].items[position] = rH[r];
        histogram[thisBin].count++;
    }
}

void histogramDriver(histogramBin* histogram, int* rH, int rSize, int bit) {
    int n = 8;
    int* deviceRH;
    histogramBin* deviceHistogram;

    cudaMalloc((void**)&deviceRH, rSize * sizeof(int));
    cudaMemcpy(deviceRH, rH, rSize * sizeof(int), cudaMemcpyHostToDevice);

    cudaMalloc((void**)&deviceHistogram, n * sizeof(histogramBin));
    cudaMemcpy(deviceHistogram, histogram, n * sizeof(histogramBin), cudaMemcpyHostToDevice);

    int* tempData[n];
    for (int i = 0; i < n; i++) {
        cudaMalloc(&(tempData[i]), rSize * sizeof(int));
    }
    for (int i = 0; i < n; i++) {
        cudaMemcpy(&(deviceHistogram[i].items), &(tempData[i]), sizeof(int*), cudaMemcpyHostToDevice);
    }
    for (int i = 0; i < n; i++) {
        cudaMemcpy(tempData[i], histogram[i].items, rSize * sizeof(int), cudaMemcpyHostToDevice);
    }

    makeHistogram<<<1, 1>>>(deviceHistogram, deviceRH, rSize, bit);
    cudaDeviceSynchronize();
}


int main(){
    const int rSize = 5;
    int rH[rSize] = {1, 2, 3, 4, 5};

    histogramBin * histogram = (histogramBin*)malloc(sizeof(histogramBin) * 8);
    for(int i = 0; i < 8; i++){
        histogram[i].items = (int*)calloc(sizeof(int), rSize);
        histogram[i].count = 0;
    }
    histogramDriver(histogram, rH, rSize, 0);
    return 0;
}
$ nvcc t1452.cu -o t1452
$ cuda-memcheck ./t1452
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

Note that the only changes here are to the kernel code itself, plus removal of the ampersand on kernel call, plus I added const to the definition of rSize to get things to compile.

I have no idea if it produces correct output, because you've included no way to inspect the output, nor indicated what you expect the output to be. If you are interested in that, those would be good things to include in your MVE.

Thank you Robert! My apologies for the incomplete MVE; I have edited it to add the expected output as you recommended. — Luciano, Jul 23 '19 at 01:11

deep copy of structs to and from device memory

1 Answers1