measure cuda execution time with respect to number of blocks

Question

I wrote my first cuda script and wonder about how it is parallelized. I have some variables r0_dev and f_dev, arrays of length (*fnum_dev) * 3 each. On each block, they are read sequentially. Then there are r_dev, which I read, and v_dev which I want to write at in parallel, both are arrays of length gnum * 3.

The program produces the results that I want it to produce, but the complexity (function of time with respect to data size) is not what I would have expected.

My expectation is, that, when the size of the array v_dev increases, the execution time stays constant, for values of gnum smaller than the number of blocks allowed in some dimension.

Reality is different. With the following code, the time was measured. A linear complexity is observed, what I would have expexted in sequential code.

dim3 blockGrid(gnum);

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

// the actual calculation
stokeslets<<<blockGrid, 1>>>(fnum_dev, r0_dev, f_dev, r_dev, v_dev);

// time measurement
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

Question:

Is my expectation, described above, wrong? What additional considerrations are important?

Details

The following shows the implementation of stokeslet. Maybe I'm doing something bad there?

__device__ void gridPoint(int offset, float* r0, float* f, float* r, float* v) {
    int flatInd = 3 * offset;

    float dr[3];
    float len = 0;
    float drf = 0;

    for (int i = 0; i < 3; i++) {
        dr[i] = r[i] - r0[i + flatInd];
        len += dr[i] * dr[i];
        drf += dr[i] * f[i + flatInd];
    }

    len = sqrt(len);
    float fak = 1 / (8 * 3.1416 * 0.7);
    v[0] +=  (fak / len) * (f[0 + flatInd] + (dr[0]) * drf / (len * len));
    v[1] +=  (fak / len) * (f[1 + flatInd] + (dr[1]) * drf / (len * len));
    v[2] +=  (fak / len) * (f[2 + flatInd] + (dr[2]) * drf / (len * len));
}


__global__ void stokeslets(int* fnum, float* r0, float* f, float* r, float* v) {
    // where are we (which block, which is equivalent to the grid point)?
    int idx = blockIdx.x;

    // we want to add all force contributions
    float rh[3] = {r[3 * idx + 0], r[3 * idx + 1], r[3 * idx + 2]};

    float vh[3] = {0, 0, 0};
    for (int i=0; i < *fnum; i++) {
        gridPoint(i, r0, f, rh, vh);
    }
    // sum intermediate velocity vh
    int flatInd = 3 * idx;
    v[0 + flatInd] += vh[0];
    v[1 + flatInd] += vh[1];
    v[2 + flatInd] += vh[2];
}

What range of gnum values did you test with? Timer precision may be an issue for short-running kernels; what was the longest time you measured? — Heatsink, Mar 23 '14 at 17:10
the values for time were between 100 and 3000, it's millisecond, I think, and the precision should be around 0.5 milliseconds. — spiehr, Mar 23 '14 at 18:30
You are running multiple blocks containing only one thread. Remember that the threads in a block are organized in warps of `32` threads and executed concurrently. In your case, you are wasting the GPU parallelism since you are running warps with only one thread. I would recommend you to rewrite your code/kernel, enabling threads concurrency. — Vitality, Mar 23 '14 at 21:43
If you are new to CUDA, to have more information on threads organization and scheduling I would suggest you to go through: [Streaming multiprocessors, Blocks and Threads (CUDA)](http://stackoverflow.com/questions/3519598/streaming-multiprocessors-blocks-and-threads-cuda), [How CUDA Blocks/Warps/Threads map onto CUDA Cores?](http://stackoverflow.com/questions/10460742/how-cuda-blocks-warps-threads-map-onto-cuda-cores) and [the Fermi architecture whitepaper](http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf). — Vitality, Mar 23 '14 at 21:44
@JackOLantern when you write an answer, I accept it :). The word warp is new to me. Probably my view of parelellism was far to naive... — spiehr, Mar 23 '14 at 21:45

score 3 · Accepted Answer · edited May 23 '17 at 12:20

The main problem with you code is that you are running multiple blocks containing only one thread.

Quoting the CUDA C Programming Guide

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently.

Quoting the answer to the post How CUDA Blocks/Warps/Threads map onto CUDA Cores?

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units.

From the two bold-marked sentences, it follows that you have only 2 threads running per clock cycle per streaming multiprocessor. This is the main reason why you are observing essentially the same computational complexity as for the sequential case.

The recommendation is to rewrite your code/kernel to host the possibility to run multiple threads per block.

Further readings: The Fermi architecture whitepaper.

measure cuda execution time with respect to number of blocks

1 Answers1