CUDA image does not show output

Question

This is the code to flip image by using CUDA kernel and opencv to read and show image, In main function, the picture of input was shown but the output shows just like black window. By the way, there is no error on the code, it can compile and run but the output looks weid. Below is what I tried so far.

#include< iostream>
#include< cstdio>
#include < opencv2/core.hpp>
#include < opencv2/imgcodecs.hpp>
#include < opencv2/highgui.hpp>
#include< cuda_runtime.h >

using std::cout;
using std::endl;

__global__ void mirror( unsigned char* input, unsigned char* output, int numRows, int numCols)
{
    //2D Index of current thread
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;
    if ( col >= numCols || row >= numRows ) return;

    int thread_x = blockDim.x * blockIdx.x + threadIdx.x;
    int thread_y = blockDim.y * blockIdx.y + threadIdx.y;
    int thread_x_new = numCols-thread_x;
    int thread_y_new = thread_y;
    int mId = thread_y * numCols + thread_x;
    int mId_new = thread_y_new * numCols + thread_x_new;
    output[mId_new] = input[mId]; 
}

 void convert_to_mirror(const cv::Mat& input, cv::Mat& output,int numrows,int numcols)
{
    const dim3 blockSize(1024,1,1);
    int a=numcols/blockSize.x, b=numrows/blockSize.y;   
    const dim3 gridSize(a+1,b+1,1);
    const size_t numPixels = numrows * numcols;
    unsigned char *d_input, *d_output;

    cudaMalloc<unsigned char>(&d_input, numPixels);
    cudaMalloc<unsigned char>(&d_output,numPixels);
    //Copy data from OpenCV input image to device memory
    cudaMemcpy(d_input,input.ptr(), numPixels,cudaMemcpyHostToDevice);
    //Call mirror kernel.
    mirror<<<gridSize, blockSize>>>(d_input,d_output, numrows, numcols);
    cudaDeviceSynchronize(); 
    //copy output from device to host
    cudaMemcpy(output.ptr(), d_output,numPixels, cudaMemcpyDeviceToHost);
    cudaFree(d_input);
    cudaFree(d_output);
}

int main()
{
    //Read input image from the disk
    cv::Mat input = cv::imread("C:/a.jpg", cv::IMREAD_COLOR);
    const int rows = input.rows;
    const int cols = input.cols;
    if(input.empty())
    {
        std::cout<<"Image Not Found!"<<std::endl;
        std::cin.get();
        return -1;
    }

    //Create output image
    cv::Mat output(rows,cols,CV_8UC3);

    //Call the wrapper function
    convert_to_mirror(input,output,rows,cols);

    //Show the input and output
    cv::imshow("Input",input);
    cv::imshow("Output",output);

    //Wait for key press
    cv::waitKey();
    return 0;
}

Have you tried writing a CPU-only implementation of `mirror<<<>>>` which does not call CUDA at all? I believe that your problem can be boiled down to either incorrect CUDA usage (in that case [MCVE](https://stackoverflow.com/help/mcve) won't contain OpenCV) or incorrect OpenCV usage (in that case MCVE won't contain any CUDA). Or both, in which case you end up with two questions, not one. — yeputons, Dec 15 '18 at 09:22

sgarizvi · Answer 1 · 2020-01-01T07:47:01.273

TLDR: The problem is with the amount of device memory allocated for the image and the indexing scheme used to access the pixel values inside the kernel. Use the corrected implementation from the last code section of this answer.

Following is the explanation of problematic aspects of the provided implementation.

1. Total number of image bytes

The input image is an 8 bit RGB image, so the theoretical number of bytes occupied by it is equal to width x height x number_of_channels. In this case, it should be numRows * numCols * 3. But practically, OpenCV allocates aligned memory for image data, so the total number of image bytes should be calculated as image.step * numrows regardless of image type and number of channels. That being said, the cudaMalloc and cudaMemcpy calls expect total number of bytes we want to allocate or copy respectively. Correct the calls as follows (adapting code from @micehlson's answer):

const size_t numBytes = input.step * numrows;
cudaMalloc<unsigned char>(&d_input, numBytes);
                                    ^
cudaMalloc<unsigned char>(&d_output, numBytes);
                                    ^

//Copy data from OpenCV input image to device memory
cudaMemcpy(d_input, input.ptr(), numBytes, cudaMemcpyHostToDevice);
                                 ^

//copy output from device to host
cudaMemcpy(output.ptr(), d_output, numBytes, cudaMemcpyDeviceToHost);
                                   ^

2. Pixel index calculation in the kernel

Since the image memory is aligned, the actual index of a pixel should be calculated using the step parameter of the Mat object. The generic formula for calculating the start index of a pixel in an OpenCV Mat is as follows:

index = row * step/bytes_per_pixel_component + (channels * column)

For an 8 bit RGB image, the number of bytes occupied by a single component of an RGB pixel is 1 byte. Meaning that an individual R or G or B occupies 1 byte while a whole RGB pixel is 3 bytes. So the starting index is calculated as

int index = row * step + 3 * column;

Since this is the start index, each individual channel of this specific pixel can be accessed by increment this index up-to the number of channels as follows:

int R = index;
int G = index + 1;
int B = index + 2;

Subsequently, the index of the pixel in the flipped image can be calculated as follows (Assuming flip about y axis):

int flipped_index = row * step + 3 * (numCols - column - 1);

Of-course, we would require the image step as an argument to the kernel.

The final kernel may look like this:

__global__ void mirror( unsigned char* input, unsigned char* output, int numRows, int numCols, int channels, int step)
{
    //2D Index of current thread
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;

    if ( col >= numCols || row >= numRows ) return;

    const int tid = row * step + (channels * col);
    const int tid_flipped = row * step + (channels * (numCols - col - 1)); //Flip about y axis

    //Copy each component of the current pixel
    for(int i=0; i<channels; i++)
        output[tid_flipped + i] = input[tid + i]; 
}

Making all the corrections, the final code may look like this:

#include<iostream>
#include<cstdio>
#include<opencv2/core.hpp>
#include<opencv2/imgcodecs.hpp>
#include<opencv2/highgui.hpp>
#include<cuda_runtime.h>

using std::cout;
using std::endl;    

__global__ void mirror( unsigned char* input, unsigned char* output, int numRows, int numCols, int channels, int step)
{
    //2D index of current thread
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;

    if ( col >= numCols || row >= numRows ) return;

    const int tid = row * step + (3 * col);
    const int tid_new = row * step + (3 * (numCols - col - 1)); //Flip about y axis

    //Copy each component of the current pixel
    for(int i=0; i<channels; i++)
        output[tid_new + i] = input[tid + i]; 
}

 void convert_to_mirror(const cv::Mat& input, cv::Mat& output,int numrows,int numcols)
{
    const dim3 blockSize(1024,1,1);

    int a=numcols/blockSize.x, b=numrows/blockSize.y;   

    const dim3 gridSize(a+1,b+1,1);

    const size_t numBytes = input.step * input.rows;

    unsigned char *d_input, *d_output;

    cudaMalloc<unsigned char>(&d_input, numBytes);
    cudaMalloc<unsigned char>(&d_output,numBytes);

    //Copy data from OpenCV input image to device memory
    cudaMemcpy(d_input,input.ptr(), numBytes, cudaMemcpyHostToDevice);

    //Call mirror kernel.
    mirror<<<gridSize, blockSize>>>(d_input,d_output, numrows, numcols, input.channels(), input.step);

    assert(cudaSuccess == cudaDeviceSynchronize()); 

    //copy output from device to host
    cudaMemcpy(output.ptr(), d_output,numBytes, cudaMemcpyDeviceToHost);

    cudaFree(d_input);

    cudaFree(d_output);
}

 int main()
 {
    //Read input image from the disk
    cv::Mat input = cv::imread("C:/a.jpg", cv::IMREAD_COLOR);
    const int rows = input.rows;
    const int cols = input.cols;

    if(input.empty())
    {
        std::cout<<"Image Not Found!"<<std::endl;
        std::cin.get();
        return -1;
    }

    //Create output image
    cv::Mat output(rows,cols,CV_8UC3);

    //Call the wrapper function
    convert_to_mirror(input,output,rows,cols);

    //Show the input and output
    cv::imshow("Input",input);
    cv::imshow("Output",output);

    //Wait for key press
    cv::waitKey();

    return 0;
 }

Compiled with the following command:

nvcc -o mirror -std=c++11 mirror.cu -I/usr/local/include/opencv4 -L/usr/local/lib -lopencv_core -lopencv_imgcodecs -lopencv_highgui

Tested with OpenCV 4.0 and CUDA 9 on Ubuntu 16.04

@JessBrown... I have tested it again and it seems to work fine for 256 x 256 image. Can you share the original image? — sgarizvi, Dec 18 '18 at 05:34
@JessBrown.. Seems fine again. It is highly recommend that you [add CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api/14038590#14038590) in the code. May be the observed behavior is due to some system configuration issue on your machine. Try executing CUDA samples to make sure if the driver/toolkit configuration is correct. I would also suggest that you create a Visual Studio project on windows rather than compiling through command line. — sgarizvi, Dec 18 '18 at 06:40
I also change to use vs2017 with cuda toolkit10.0 and it has lots of errors like these https://imgur.com/mTSeORu — Jess Brown, Dec 18 '18 at 08:00
@JessBrown.. That is a different topic requiring extended discussion. Please go through [this post](http://www.programmerfish.com/running-opencv-in-microsoft-visual-studio/) or any other tutorial describing how to properly setup OpenCV with Visual Studio. — sgarizvi, Dec 18 '18 at 09:24

score 1 · Answer 2 · answered Dec 15 '18 at 10:26

TLDR; OpenCV already has such functionality, also in GPU flavor: cv::cuda::flip and call it like cv::cuda::flip(input, output, 1);

First of all, you are using color image - CV_8UC3 - it means that single pixel is not an unsigned char as you wrote but cv::Vec3b. So it an uchar for each of the R,G,B colors. This requires some tweaks into the code:

__global__ void mirror(unsigned char* input, unsigned char* output, int numRows, int numCols)
{
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;

    if(col >= numCols || row >= numRows) return;

    int mirrorCol = numCols - col;

    int idx = row * numCols * 3 + col * 3;
    int mirrorIdx = row * numCols * 3 + mirrorCol * 3;

    output[mirrorIdx] = input[idx]; //R
    output[mirrorIdx + 1] = input[idx + 1]; //G
    output[mirrorIdx + 2] = input[idx + 2]; //B
}

void convert_to_mirror(const cv::Mat& input, cv::Mat& output, int numrows, int numcols)
{
    const dim3 blockSize(1024, 1, 1);
    int a = numcols / blockSize.x, b = numrows / blockSize.y;
    const dim3 gridSize(a + 1, b + 1, 1);
    const size_t numPixels = numrows * numcols;
    const size_t numBytes = numPixels * 3; // <----- to transfer all channels R,G,B
    unsigned char *d_input, *d_output;

    cudaMalloc<unsigned char>(&d_input, numBytes);  
    cudaMalloc<unsigned char>(&d_output, numBytes); 

    //Copy data from OpenCV input image to device memory
    cudaMemcpy(d_input, input.ptr(), numBytes, cudaMemcpyHostToDevice);

    //Call mirror kernel.
    mirror << <gridSize, blockSize >> > (d_input, d_output, numrows, numcols);
    cudaDeviceSynchronize();
    //copy output from device to host
    cudaMemcpy(output.ptr(), d_output, numBytes, cudaMemcpyDeviceToHost);

    cudaFree(d_input);
    cudaFree(d_output);
}

Also if you want to process images on GPU you may want to look into GpuMat class or for manual image memory access, already encapsulating pixel type - PtrStep

Thank you so much for your help and I do have further question because my output shows like this. [link](https://imgur.com/tEKPYLN) ` What should I change the code. — Jess Brown, Dec 15 '18 at 12:47
Looks like total mashup of all channels, just use `cv::cuda::flip` if you want mirror image. Its probably better optimized and works for all image types. — michelson, Dec 15 '18 at 14:08

CUDA image does not show output

2 Answers2

1. Total number of image bytes

2. Pixel index calculation in the kernel