I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU.
For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to generate an image with a width of W
pixels and a height of H
pixels, the setup is:
- Grid:
W
blocks xH
blocks. - Block:
5
threads.
The most expensive computation is the resolution of the equations, that I solve with a custom Runge Kutta 4-5 algorithm.
The code is quite long and hard to explain in such a short question, but you can see it in Github. The CUDA kernel is here and the Runge Kutta solver is here. The CPU version with the sequential version of the exact same solver can be found in the same repo.
The equations to solve involve several computations, and I'm afraid the CPU optimization of some functions like sin
, cos
and sqrt
is causing the low speedup (?)
My machine specs are:
- GPU: GeForce GTX 780
- CPU: Intel Core i7 CPU 930 @ 2.80GHz
My questions are:
- Is it normal to get a speedup of 3x or 4x in a GPU-parallelized raytracer against a sequential code?
- Do you see anything wrong in the CUDA setup or in the code that could be causing this behaviour?
- Am I missing something important?
I understand the question can be too specific, but if you need more information, just say it, I'll be glad to provide it.