Proper way to write kernel functions in CUDA?

Question

I am just about to embark on converting a program I wrote into CUDA to hopefully increase processing speed.

Now obviously my old program executes many functions one after the other, and I have separated these functions in my main program and call each one in order.

void main ()
{
  *initialization of variables*
  function1()
  function2()
  function3()
  print result;
}

These functions are inherently serial, as funtion2 is dependent on the results of funtion1.

Alright, so now I want to convert these functions into kernels, and run the tasks in the functions in parallel.

Is it as simple as rewriting each function in a parallel way, and then in my main program, call each kernel one after the other? Is this slower than it needs to be? For example can I have my GPU directly execute the next parallel operation without going back to the CPU to initialize the next kernel?

Obviously I will keep all run time variables on the GPU memory to limit the amount of data transfer going on, so should I even worry about the time it takes between kernel calls?

I hope this question is clear, if not please ask me to elaborate. Thanks.

And here is an extra question so that I can check my sanity. Ultimately this program's input is a video file, and through the different functions, each frame will lead to a result. My plan is to grab multiple frames at a time (say 8 unique frames) and then divide the total number of blocks I have among these 8 frames, and then the multiple threads in the blocks will be doing even more parallel operations on the image data, such as vector addition, Fourier transforms, etc.
Is this the right way to approach the problem?

Roger Dahl · Accepted Answer · 2012-07-18T21:32:53.500

There are some cases where you can get programs to run at the full potential speed on the GPU with very little porting work from a plain CPU version, and this might be one of them.

If it's possible for you to have a function like this:

void process_single_video_frame(void* part_of_frame)
{
  // initialize variables
  ...
  intermediate_result_1 = function1(part_of_frame);
  intermediate_result_2 = function2(intermediate_result_1);
  intermediate_result_3 = function3(intermediate_result_2);
  store_results(intermediate_result_3);
}

and you can process many part_of_frames at the same time. Say, a few thousand,

and function1(), function2() and function3() go through pretty much the same code paths (that is, the program flow does not depend heavily on the contents of the frame),

then, local memory may do all the work for you. Local memory is a type of memory that is stored in global memory. It is different from global memory in a subtle, yet profound way... The memory is simply interleaved in such a way that adjacent threads will access adjacent 32 bit words, enabling the memory access to be fully coalesced if the threads all read from the same location of their local memory arrays.

The flow of your program would be that you start out by copying part_of_frame to a local array and prepare other local arrays for intermediate results. You then pass pointers to the local arrays between the various functions in your code.

Some pseudocode:

const int size_of_one_frame_part = 1000;

__global__ void my_kernel(int* all_parts_of_frames) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int my_local_array[size_of_one_frame_part];
    memcpy(my_local_array, all_parts_of_frames + i * size_of_one_frame_part);
    int local_intermediate_1[100];
    function1(local_intermediate_1, my_local_array);
    ...
}

__device__ void function1(int* dst, int* src) {
   ...
}

In summary, this approach may let you use your CPU functions pretty much unchanged, as the parallelism does not come from creating parallelized versions of your functions, but instead by running the entire chain of functions in parallel. And this again is made possible by the hardware support for interleaving the memory in local arrays.

Notes:

The initial copy of the part_of_frame from global to local memory is not coalesced, but hopefully, you will have enough calculations to hide that.
On devices of compute capability <= 1.3, there is only 16KiB of local memory available per thread, which may not be enough for your part_of_frame and the other intermediate data. But on compute capability >= 2.0, this has bee expanded to 512KiB, which should be plenty.

This is a very interesting approach, however if I understand your your method correctly I do not think I can do this. I first start with a 512x512 raw video file. From that, there is a 128x128 section I analyze and do all the functions on. From that 128 by 128 section, certain tasks are dependent on the entire image, so I would not be able to independently take a 16x16 square and do manipulations on that, and then just add that at the end. — Shawn Tabrizi, Jul 19 '12 at 00:52
Okay after rereading your idea, I think it might work, but I want to clarify to make sure I am not taking your idea out of proportion. My C program currently runs on 1 frame, using a single CPU thread (as far as I can tell). Would I be able to then pass #MAXBLOCKS*#MAXTHREADS number of images to my program, and have each thread go though my C program once on the frame? Which means that it will be slow to get a result, but I would get like 60,000 frames processed in one go? as in, use int i = blockIdx.x * blockDim.x + threadIdx.x; to determine which image to processes, and then crunch it. — Shawn Tabrizi, Jul 19 '12 at 01:39
Also should I be worried about the sheer magnitude of runtime variables and memory allocation at this scale? Assuming I process ~60,000 images, each on one thread, this alone is around 1gb of data (128x128 8 bit image). Then I will be also making 60,000 copies of all the run time variables, and allocating them all in the kernel, which i have read is not a good idea. — Shawn Tabrizi, Jul 19 '12 at 01:45
I would do the frames in smaller batches so that host to device and device to host copying can run concurrently with the kernel. Remember, you don't have to schedule all 60,000 frames at the same time -- you only have to schedule enough to saturate the GPU, which might mean a a couple of thousand on a high end 2.0 device. Even if you do schedule all 60,000, those threads won't all run concurrently. The number of concurrent threads are limited by various GPU resources, such as number of registers. — Roger Dahl, Jul 19 '12 at 03:37
So, it would go something like this: Copy the first batch of 2000 frames to the device. Start kernel on first batch. Copy the second batch of 2000 frames to the GPU. Wait for first batch to complete. Start kernel on second batch. Copy results from first batch to host. Wait for second batch to complete. Repeat. — Roger Dahl, Jul 19 '12 at 03:42
Your advice is awesome. Is there somewhere I can learn or read up on making sure the copying of images runs concurrently with the processing of the data? For some reason, I cant imagine doing these two processes not on a stack. — Shawn Tabrizi, Jul 19 '12 at 17:21
You can use streams. See for instance `3.2.5 Asynchronous Concurrent Execution` and `3.2.5.5.5 Overlapping Behavior` in the `CUDA C Programming Guide 4.2.`. — Roger Dahl, Jul 19 '12 at 17:37

score 5 · Answer 2 · answered Jul 18 '12 at 20:16

5

Answering some of your questions:

Calling a kernel isn't that expensive, so don't be afraid of the program flow returning from the GPU to the CPU. As long as you keep your results in the GPU memory, there won't be much overhead. If you want to, you can make a kernel that simply calls other device functions in a sequence. AFAIK this will be harder to debug and profile, I am not sure if one can even profile functions called by a kernel.

Regarding parallelization:

I think any idea that allows you to run computation on multiple data streams is good. The more your code resembles a shader, the better (meaning it will have the required characteristics to run fast on a gpu). The idea with multiple frames is nice. Some hints about it: minimize synchronization as much as possible, access memory as rarely as possible, try to increase the ratio of computation time to IO requests time, make use of gpu registers / shared memory, prefer many-read-from-one to one-writes-to-many designs.

answered Jul 18 '12 at 20:16

emesx

12,555
10
58
91

Thank you, most everything you said was perfectly clear :] How about the synchronization thing though, what do you mean by that, and how would i minimize it? – Shawn Tabrizi Jul 19 '12 at 01:00
It's possible to instrument the code to select a section to profile. The calls are cudaProfilerStart() and cudaProfilerStop(). With these, you can focus on one a single device function or some section of code, at a time. – Roger Dahl Jul 19 '12 at 04:10
Good tips. I would just add, don't optimize prematurely. Start with a simple implementation just to see what the compiler and GPU are able to do with it and take it from there. – Roger Dahl Jul 19 '12 at 04:14
@user1396977 In other words, the less the threads need to communicate with each other the better. For example in image processing you can process fragments of images in separate threads and there are no (usually) intermediate results. In algorithms with iterative character, like neural networks, synchronization is necessary e.g. to ensure that with each start of an iteration all threads have completed the previous run and have updated the state of all variables. – emesx Jul 19 '12 at 14:42

score 1 · Answer 3 · answered Jul 18 '12 at 20:41

1

If the GPU resources are enough to handle 3 functions in a single kernel, then you can either put your functions into a big kernel, or you can serially launch 3 kernels to run the functions separately. There is little different in terms of performance, because kernel launch has negligible hardware overhead, low software overhead.

However, if the GPU resources is not enough, putting 3 functions in a single kernel could sacrifice the performance. In this case, it's better to put each function into separate kernel.

answered Jul 18 '12 at 20:41

chaohuang

3,965
4
27
35

It is true that there is very little overhead in launching a kernel. But each kernel launch means that you have to read the source data from global memory and write the results back to global memory. If you can chain function calls together in a single kernel, you may be able to avoid the global memory roundtrips. – Roger Dahl Jul 18 '12 at 21:39
@RogerDahl right, but, say you have 2 kernel, then you only need to write and read data once in the global memory, so usually this is negligible compared to the time spent on computation in kernels. – chaohuang Jul 18 '12 at 21:57
Well, it depends. Two kernels may give R P W R P W, while one kernel with two functions may give R P P W. (R and W are global read/write and P is processing). The difference depends on how much time P takes. – Roger Dahl Jul 18 '12 at 22:15
since nearly all the variables between functions are run time variables, wouldn't I be able to just save the results in the GPU memory, and then just pass it by reference? Thus avoiding any type of slow global variable call? – Shawn Tabrizi Jul 19 '12 at 01:03
@user1396977 yes you can pass them by reference, which will be stored in shared memory. – chaohuang Jul 19 '12 at 01:06
Also note that global memory accesses are only slow if there are not enough calculations to hide the latency of the L1 and L2 cache misses. If you do a fair amount of processing on the image data and you access the global memory in a way that enables the L1 and L2 caches to do their job, there's a good chance that you won't have to worry about latency. There are excellent profiling tools that will tell you if memory access is slowing down processing. I would first do an "optimistic", simple implementation, and only worry about optimizing memory access later, if it turns out to be necessary. – Roger Dahl Jul 19 '12 at 04:03

Proper way to write kernel functions in CUDA?

3 Answers3