CUDA Device To Device transfer expensive

Question

I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.

    int leftover = W-dcW;

    T *temp;
    T *topHalf;
cudaMalloc((void **)&temp, dcW * sizeof(T));

    //swap every row, left and right
    for(int i = 0; i < H; i++)
    {
        cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice); 
    }

cudaMalloc((void **)&topHalf, dcH*W* sizeof(T));
    leftover = H-dcH;
    cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);

Notice that this code takes device pointers, and does DeviceToDevice transfers.

Why does this seem to run so slow? Can this be optimized somehow? I timed this compared to the same operation on host using regular memcpy and it was about 2x slower.

Any ideas?

Launching cudaMemcpy is a costly. You are better off writing a kernel that reads from the input, swaps and writes to the appropriate location than putting cudaMemcpy in a for loop. — Pavan Yalamanchili, May 20 '11 at 01:20
hrmmm..bummer. What about the comparison of doing a host memcpy, and transferring to device? — Derek, May 20 '11 at 15:05

score 9 · Accepted Answer · answered May 27 '11 at 14:40

9

I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations

answered May 27 '11 at 14:40

Derek

11,715
32
127
228

Vitality · Answer 2 · 2013-01-12T21:46:13.070

Perhaps the following solution to perform the 2d fftshift in CUDA would be of interest:

#define IDX2R(i,j,N) (((i)*(N))+(j))

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    int i = threadIdx.y + blockDim.y * blockIdx.y;
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        data[IDX2R(i,j,N2)].x *= a;
        data[IDX2R(i,j,N2)].y *= a;
    }
}

It consists in multiplying the matrix to be transformed by a chessboard of 1s and -1s which is equivalent to the multiplication by exp(-j*(n+m)*pi) and thus to shifts in both directions in the conjugate domain.

You have to call this kernel before and after the application of the CUFFT.

One pro is that memory movements/swapping are avoided.

IMPROVEMENT IN SPEED

Following the suggestion received at the NVIDIA Forum, improved speed can be achieved as by changing the instruction

double a = pow(-1.0,(i+j)&1);

to

double a = 1-2*((i+j)&1);

to avoid the use of the slow routine pow.

Indeed in almost all filtering applications this step can be removed by keeping all filters in the wrapped fft space. — Mikhail, Jun 23 '14 at 06:45

CUDA Device To Device transfer expensive

2 Answers2

Linked