I wrote my first cuda script and wonder about how it is parallelized.
I have some variables r0_dev
and f_dev
, arrays of length (*fnum_dev) * 3
each. On each block, they are read sequentially. Then there are r_dev
, which I read, and v_dev
which I want to write at in parallel, both are arrays of length gnum * 3
.
The program produces the results that I want it to produce, but the complexity (function of time with respect to data size) is not what I would have expected.
My expectation is, that, when the size of the array v_dev
increases, the execution time stays constant, for values of gnum
smaller than the number of blocks allowed in some dimension.
Reality is different. With the following code, the time was measured. A linear complexity is observed, what I would have expexted in sequential code.
dim3 blockGrid(gnum);
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// the actual calculation
stokeslets<<<blockGrid, 1>>>(fnum_dev, r0_dev, f_dev, r_dev, v_dev);
// time measurement
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
Question:
Is my expectation, described above, wrong? What additional considerrations are important?
Details
The following shows the implementation of stokeslet
. Maybe I'm doing something bad there?
__device__ void gridPoint(int offset, float* r0, float* f, float* r, float* v) {
int flatInd = 3 * offset;
float dr[3];
float len = 0;
float drf = 0;
for (int i = 0; i < 3; i++) {
dr[i] = r[i] - r0[i + flatInd];
len += dr[i] * dr[i];
drf += dr[i] * f[i + flatInd];
}
len = sqrt(len);
float fak = 1 / (8 * 3.1416 * 0.7);
v[0] += (fak / len) * (f[0 + flatInd] + (dr[0]) * drf / (len * len));
v[1] += (fak / len) * (f[1 + flatInd] + (dr[1]) * drf / (len * len));
v[2] += (fak / len) * (f[2 + flatInd] + (dr[2]) * drf / (len * len));
}
__global__ void stokeslets(int* fnum, float* r0, float* f, float* r, float* v) {
// where are we (which block, which is equivalent to the grid point)?
int idx = blockIdx.x;
// we want to add all force contributions
float rh[3] = {r[3 * idx + 0], r[3 * idx + 1], r[3 * idx + 2]};
float vh[3] = {0, 0, 0};
for (int i=0; i < *fnum; i++) {
gridPoint(i, r0, f, rh, vh);
}
// sum intermediate velocity vh
int flatInd = 3 * idx;
v[0 + flatInd] += vh[0];
v[1 + flatInd] += vh[1];
v[2 + flatInd] += vh[2];
}