CUDA-parallelized raytracer: very low speedup

Question

I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU.

For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to generate an image with a width of W pixels and a height of H pixels, the setup is:

Grid: W blocks x H blocks.
Block: 5 threads.

The most expensive computation is the resolution of the equations, that I solve with a custom Runge Kutta 4-5 algorithm.

The code is quite long and hard to explain in such a short question, but you can see it in Github. The CUDA kernel is here and the Runge Kutta solver is here. The CPU version with the sequential version of the exact same solver can be found in the same repo.

The equations to solve involve several computations, and I'm afraid the CPU optimization of some functions like sin, cos and sqrt is causing the low speedup (?)

My machine specs are:

GPU: GeForce GTX 780
CPU: Intel Core i7 CPU 930 @ 2.80GHz

My questions are:

Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?
Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?
Am I missing something important?

I understand the question can be too specific, but if you need more information, just say it, I'll be glad to provide it.

The arithmetic throughput for double operations (+,-,*) is only [1/24 of the single precision throughput](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions) on your GTZ 780. Furthermore, using blocks of 5 threads only 5/32 of each warp are used. These two issues immediately reduce the impressive peak 4 TFLOP/s of your GPU to just 26 GFLOP/s. — tera, Aug 26 '16 at 18:47
Furthermore using just one warp per block gives you at most 25% occupancy, so that latencies can only partially be hidden. — tera, Aug 26 '16 at 18:55
You are better off combining 32 or 64 pixels into one block, so that a block uses 160 or 320 threads. — tera, Aug 26 '16 at 18:56
I'm testing other configurations following your advices and my code is now 15x faster than the sequential code: now I have one only thread per ray and I'm arranging the blocks in multiples of 32. I don't know, however, if keeping the 5 threads per ray structure while combining them into one block as you said will be still better, I'll let you know :) The precision is quite important for me, so changing to single precision is not an option — Alejandro, Aug 27 '16 at 10:23
Also try 64, 128, and 256 threads per block. Using just 32 still limits your occupancy. Also, if you don't do that already, assign warps to e.g. 8×4 horizontal×vertical pixels instead of 32×1, as the number of Runge-Kutta steps needed might vary less within a warp if pixels are closer together, leading to fewer disabled threads and better resource use. — tera, Aug 27 '16 at 11:05
Even if you can't just replace every use of double precision with single precision, there might be some places where you can use single precision, while staying with double elsewhere. — tera, Aug 27 '16 at 11:09
Now I have blocks of two warps in which each of them focuses on a 8x4 zone of the image. I'm getting great results: at the very beginning of this question I was obtaining 63 seconds, I'm now at 2.92!! :) I decided then to follow the advice of your last comment, @tera, but changing `sincos` to `sincosf` (see [this question](http://stackoverflow.com/questions/39176708/is-there-any-way-to-optimize-sincos-calls-in-cuda)) buys me just 0.1 seconds; I don't know where else to look for improving the speedup, any more ideas? Thank you again :) — Alejandro, Aug 27 '16 at 15:36
Just replacing double precision functions with their float equivalent isn't enough. Note that any constant that is not explicitly denoted as float is implicitly of double type, you have a lot of double variables, and any mixed expression will be promoted to double precision. So you need to apply "f" modifiers to double precision constants and change variable types as well. — tera, Aug 27 '16 at 17:01

score 4 · Answer 1 · edited May 23 '17 at 12:22

Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?

How long is a piece of string? There is no answer to this question.

Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?

Yes, as noted in comments, you are using a completely inappropriate block size which is wasting approximately 85% of the potential computational capacity of your GPU.

Am I missing something important?

Yes, the answer to this question. Setting correct execution parameters is about 50% of the practical performance tuning requirements in CUDA, and you should be able to obtain noticeable performance improvements just be selecting a sane block size. Beyond this, careful profiling should be your next port of call.

[This answer assembled from comments and added as community wiki entry to get this (very broad) question off the unanswered list in the absence of enough close votes to close it].

Apologies Talonmies for being slow writing an answer - I've been away from my computer for a while. Thanks for filling in the void. — tera, Aug 27 '16 at 17:02

CUDA-parallelized raytracer: very low speedup

1 Answers1

Linked