1

I am new to CUDA and I am trying a very simple task myself -- copying memory to the graphic card and then copy back. Here is the simplified version of the code that I wrote.

const int arraySize = 100; 

int* data1 = NULL; 
int* data2 = NULL;
// Initialized both data1 and data2
// ... 

int* dev_data1 = NULL; 
int* dev_data2 = NULL; 
// Initialized both dev_data1 and dev_data2
// ... 

// copy data1 to device
cudaMemcpy(dev_data1, data1, arraySize*sizeof(int), cudaMemcpyHostToDevice );

// copy dev_data1 to dev_data2 with gpu
gpuCopy<<<1, arraySize>>>( dev_data1, dev_data2 ); 

// copy dev_data2 to data
cudaMemcpy(data2, dev_data2, arraySize*sizeof(int), cudaMemcpyDeviceToHost );

And gpuGopy is as the following:

__global__ void gpucopy( int* src, int* dst )
{
    int i = threadIdx.x;
    dst[i] = src[i];
} 

I found that if arraySize is small, the above function works. But if arraySize reach a specific size, data2 will become all zeros. My guess is that there is some sort of limitation when running the gpu functions. But is there a way to find that out? If I have a very big array, how can I copy it to GPU (and back)?

Yuchen
  • 30,852
  • 26
  • 164
  • 234
  • Pleas post a full compileable sample! With the snippets you've posted it's not possible the help you with your Problem about the 0's. Also plese do [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) – hubs Feb 14 '14 at 22:16
  • How large is "a specific size"? – maxywb Feb 14 '14 at 22:28
  • @maxywb. Just tested on my machine. The specific size is 512. But based on the following answers, this number should vary between different machines. – Yuchen Feb 14 '14 at 22:34
  • 1
    The code as posted would have a kernel launch failure (resulting in no modified output) at either 513 (and above) or 1025 (and above) for `arraySize`, depending on the specific GPU and compilation commands being passed. – Robert Crovella Feb 14 '14 at 22:37

2 Answers2

5

First of all you should be doing proper cuda error checking

Secondly, you may be under the impression that threadIdx.x gives a globally unique thread ID. It does not.

So modify your kernel line from:

int i = threadIdx.x;

to:

int i = blockIdx.x * blockDim.x + threadIdx.x;

Finally, your threads per block parameter cannot exceed 512 or 1024 depending on the GPU. We launch larger grids by specifying both a threads per block and blocks per grid parameter:

#define nTPB 512
gpuCopy<<<(arraySize + nTPB - 1)/nTPB, nTPB>>>( dev_data1, dev_data2 ); 

Coupled with this grid sizing approach, we usually include a thread-check in the kernel, to prevent out of bounds accesses for arbitrary grid/problem sizes:

__global__ void gpucopy( int* src, int* dst, int size )
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < size)
      dst[i] = src[i];
}

And likewise we have to tell that kernel the problem size:

gpuCopy<<<(arraySize + nTPB - 1)/nTPB, nTPB>>>( dev_data1, dev_data2, arraySize ); 

You might want to review the CUDA programming guide

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
0

The threadIdx index is architecture-dependent and you cannot just set it at whatever you like.

The following code works until arraySize == 1024 on my system, but then at arraySize == 1025 I get undefined values

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
using namespace std;

__global__ void gpucopy( int* src, int* dst )
{
  int i = threadIdx.x;
  dst[i] = src[i];
} 

int main()
{
  const int arraySize = 500; // >= 1025 will fail on my system!

  int* data1 = new int[arraySize]; 
  int* data2 = new int[arraySize];
  // Initialized both data1 and data2
  // ... 
  for(int i=0; i<arraySize; i++)
    data1[i] = 2*i;

  int* dev_data1 = NULL; 
  int* dev_data2 = NULL; 
  // Initialized both dev_data1 and dev_data2
  // ... 
  cudaMalloc(&dev_data1, arraySize*sizeof(int));
  cudaMalloc(&dev_data2, arraySize*sizeof(int));

  // copy data1 to device
  cudaMemcpy(dev_data1, data1, arraySize*sizeof(int), cudaMemcpyHostToDevice );

  // copy dev_data1 to dev_data2 with gpu
  gpucopy<<<1, arraySize>>>( dev_data1, dev_data2 ); 

  // copy dev_data2 to data
  cudaMemcpy(data2, dev_data2, arraySize*sizeof(int), cudaMemcpyDeviceToHost );


  for(int i=0; i<arraySize; i++)
    if(data2[i] != data1[i])
      cout << "Error: data is different - data2[" << i << "] is " << data2[i] << endl;

      return 0;
}

You can find out this value by either looking at the documentation or with the deviceQuery() function

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2050"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2688 MBytes (2818572288 bytes)
  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores
  GPU Clock Speed:                               1.15 GHz
  Memory Clock rate:                             1500.00 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024 <-----
Marco A.
  • 43,032
  • 26
  • 132
  • 246
  • I think you'll get an error with your kernel call, because depending in compute capability it's only possible to launch up to 1024 threads per block. When you're usw error checking you'll see that. – hubs Feb 14 '14 at 22:23
  • That's correct, in fact I can only launch it with <= 1024. 1025 will fail – Marco A. Feb 14 '14 at 22:25