Crash on cudaMalloc when allocating 2D array

Question

I am trying to build arrays of histograms of unsigned char corresponding to each pixel in an image for the gPb algorithm implementation. I have a crash on a cudaMalloc call which I cannot solve. I have looked through other similar questions and I tested always if the previous operations returned cudaSuccess or not. Here is my code:

First I allocate this structure in constructor of my class CudaImage:

bool CudaImage::create2DHistoArray()
{
    //preparing histograms
    m_LastCudaError = cudaMalloc((void**)&m_dHistograms, (m_Height + 2 * m_Scale) * sizeof(unsigned int*));

    if (m_LastCudaError != cudaSuccess)
        return false;

    //set all histograms to nullptr
    m_LastCudaError = cudaMemset(m_dHistograms, 0, (m_Height + 2 * m_Scale) * sizeof(unsigned int*));

    if (m_LastCudaError != cudaSuccess)
         return false;

    return true;
}

then at some point I would call a member function to allocate some of m_dHistograms[i] as follows:

bool CudaImage::initializeHistoRange(int start, int stop)
{ 
    for (int i = start; i < stop; ++i) {
        m_LastCudaError = cudaMalloc((void**)&m_dHistograms[i], 256 * 2 * m_ArcNo * (m_Width + 2 * m_Scale) * sizeof(unsigned int));
        if (m_LastCudaError != cudaSuccess) {
            return false;
        }

        //set all pixels in the gradient images to 0
        m_LastCudaError = cudaMemset(m_dHistograms[i], 0, 256 * 2 * m_ArcNo * (m_Width + 2 * m_Scale) * sizeof(unsigned int));
        if (m_LastCudaError != cudaSuccess)
            return false;
        }

    return true;
}

The first cudaMalloc in this last function crashes without a single warning. When running with cuda-memcheck I get the following message:

"The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under a host debugger to catch such errors."

Can anyone help ? Another question would be if the array allocation was correctly implemented. I do not want to allocate all memory from the beginning because it will be too much so I allocate in constructor (first function) only the pointers to the rows of the array and then in the application I allocate memory when I need it and free what I do not need.

It is illegal to read or modify the value of `m_dHistograms[i]` in host code because it not allocated in host memory. This question comes up at least once a week on [SO] and there must be at least 50 answers discussing how this must be done — talonmies, Nov 13 '18 at 13:00
@talonmies I have already spent some time looking for the answer on stack overflow. I cannot allocate all the memory from start because it will be too much. — Cristi, Nov 13 '18 at 13:07
https://stackoverflow.com/a/6137517/681865 https://stackoverflow.com/q/15431365/681865 https://stackoverflow.com/q/12924155/681865 All from [frequently asked CUDA questions](https://stackoverflow.com/questions/tagged/cuda?sort=frequent&pageSize=50) — talonmies, Nov 13 '18 at 14:33

talonmies · Accepted Answer · 2018-11-13T15:22:07.610

You are getting segfaults because it is illegal to read or modify the value of m_dHistograms[i] in host code, given it is allocated in device memory. What you need to do is something like this:

bool CudaImage::initializeHistoRange(int start, int stop)
{ 
    for (int i = start; i < stop; ++i) {
        // Allocated memory
        unsigned int* p;
        m_LastCudaError = cudaMalloc((void**)&p, 256 * 2 * m_ArcNo * (m_Width + 2 * m_Scale) * sizeof(unsigned int));
        if (m_LastCudaError != cudaSuccess) {
            return false;
        }

        //set all pixels in the gradient images to 0
        m_LastCudaError = cudaMemset(p, 0, 256 * 2 * m_ArcNo * (m_Width + 2 * m_Scale) * sizeof(unsigned int));
        if (m_LastCudaError != cudaSuccess)
            return false;
        }

        // Transfer address of allocation to device    
        m_LastCudaError = cudaMemcpy(m_dHistograms + i, &p, sizeof(unsigned int *), cudaMemcpyHostToDevice);
        if (m_LastCudaError != cudaSuccess)
            return false;
        }
    return true;
}

[disclaimer: never compiled or run, use at your risk]

Here the allocation address is stored in a host variable which is finally copied to the device array after the allocation and memset operations are done. This incurs the penalty of an additional host to device memory transfer per allocation.

Thank you for your answer. Please bear with me until I test your solution. — Cristi, Nov 13 '18 at 14:38

score 0 · Answer 2 · answered Nov 15 '18 at 09:14

The solution that I found is with the help of this stackoverflow answer. The code is as follows:

bool CudaImage::initializeHistoRange(int start, int stop)
{
    for (int i = start; i < stop; ++i) {
        m_LastCudaError = cudaMalloc((void**)&m_hHistograms[i], 256 * 2 * m_ArcNo * (m_Width + 2 * m_Scale) * sizeof(unsigned int));
        if (m_LastCudaError != cudaSuccess) {
            return false;
        }

        cudaMemcpy(m_dHistograms, m_hHistograms, stop * sizeof(unsigned int*), cudaMemcpyHostToDevice);
        if (m_LastCudaError != cudaSuccess)
            return false;
    }

    return true;
}

bool CudaImage::create2DHistoArray()
{
    m_LastCudaError = cudaMalloc((void**)&m_dHistograms, (m_Height + 2 * m_Scale) * sizeof(unsigned int*));

    if (m_LastCudaError != cudaSuccess)
        return false;

    m_hHistograms = (unsigned int**)malloc((m_Height + 2 * m_Scale) * sizeof(unsigned int*));

    return true;
}

That is I am using an additional member in the host member which helps me to create the memory in the device. The code for freeing memory during the algorithm operation is :

void CudaImage::deleteFromHistoMaps(int index) {

    //I need some more device memory
    if (index + m_Scale + 1 < m_Height + 2 * m_Scale) {
        initializeHistoRange(index + m_Scale + 1, index + m_Scale + 2);
    }

    //device memory is not needed anymore - free it
    if (index >= m_Scale + 1) {
        cudaFree(m_hHistograms[index - m_Scale - 1]);
        m_hHistograms[index - m_Scale - 1] = nullptr;
    }
}

Crash on cudaMalloc when allocating 2D array

2 Answers2