You have two major problems here:
nppiBGRToYCbCr420_8u_C3P3R
converts a BGR image with interleaved BGR pixel values to one Y image, one Cb image and one Cr image. I.e. the image is output in three separated planes, thus the P in “C3P3”.
- Due to the 420 coding, the color information is subsampled, meaning the image plane for Cb and Cr has only half the size of the original image.
And using nppiMalloc_8u_C1 to allocate the device output image would give something like (omitting error checking for simplicity and written here in browser without checking):
Mat temp = imread("1.jpg",1);
Npp8u *d_arrayY, *d_arrayCB, *d_arrayCR;
GpuMat BGR(temp);
unsigned char *host_array = (unsigned char*)malloc(temp.cols * temp.rows * sizeof(unsigned char ));
memset(host_array,0,temp.cols * temp.rows * sizeof(unsigned char));
size_t pitchY, pitchCB, pitchCR ;
d_arrayY = nppiMalloc_8u_C1(temp.cols, temp.rows, &pitchY);
d_arrayCB = nppiMalloc_8u_C1(temp.cols/2, temp.rows/2, &pitchCB);
d_arrayCR = nppiMalloc_8u_C1(temp.cols/2, temp.rows/2, &pitchCR);
int Dstep[3] = {pitchY,pitchCB,pitchCR};
Npp8u* d_ptrs[3] = {d_arrayY, d_arrayCB, d_arrayCR};
NppiSize ds;
ds.height = temp.rows;
ds.width = temp.cols;
nppiBGRToYCbCr420_8u_C3P3R(BGR.ptr<Npp8u>(), BGR.step, d_ptrs, Dstep, ds);
cudaMemcpy2D(host_array, temp.cols, d_arrayY, pitchY, temp.cols * sizeof(Npp8u), temp.rows, cudaMemcpyDeviceToHost);