Optimizing an OpenCL Kernel

Question

I'm trying to optimize the following Kernel using OpenCL. Basically I'm doing triangle rasterization using a frame buffer of 800*800 pixels, and checking the bounding box of each triangle and if the current pixel that is scanned line is inside the box of the triangle then I rasterize it.

here is the kernel and how do I invoke it:

  global_size[0] = 800;
  global_size[1] = 800;
  

  auto time_start = std::chrono::high_resolution_clock::now();
  CL_CHECK(clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global_size, NULL, 0, NULL, NULL));
  CL_CHECK(clFinish(commandQueue));

And the kernel:

size_t px = get_global_id(0); // triCount
size_t py = get_global_id(1); // triCount
int width = 800;
int height = 800;

float3 v0Raster = (float3)(triangles[px].v[0].pos[0], triangles[px].v[0].pos[1], triangles[px].v[0].pos[2]);
float3 v1Raster = (float3)(triangles[px].v[1].pos[0], triangles[px].v[1].pos[1], triangles[px].v[1].pos[2]);
float3 v2Raster = (float3)(triangles[px].v[2].pos[0], triangles[px].v[2].pos[1], triangles[px].v[2].pos[2]);
float xmin = min3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymin = min3(v0Raster.y, v1Raster.y, v2Raster.y);
float xmax = max3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymax = max3(v0Raster.y, v1Raster.y, v2Raster.y);
float slope = (ymax - ymin) / (xmax - xmin);
int dp, y;
bool discard_;
float ratio;
for (int x = round(xmin); x <= round(xmax); x++) {
    y = slope * (x - round(xmin) + ymin);
    ratio = (x - round(xmin) / (round(xmax) - round(xmin)));
    discard_ = false;

    int flatIdx = width - x + (height - y) * width;
    if (y < 0 || y > height || x < 0 || x > width) {
        discard_ = true;
    }
    if (!discard_) {


        fragments[flatIdx].col[0] = 1.0f;
        fragments[flatIdx].col[1] = 0;
        fragments[flatIdx].col[2] = 0;

    }
}

For a cube it takes one second to render it, which is so slow..

I do not understand the role of `index`, it seems to be calculated based on image dimensions yet you use it to index triangles? So are you launching a thread per pixel as you describe? If so, rasterizing the full triangle in each thread is really wasteful (and contradicts your description) and will likely cause bad memory access even if you have launched per-triangle threads. Can you please add a short pseudocode? I believe a better algorithm would be per-pixel thread which finds the closest ray-triangle collision. — Quimby, Sep 17 '22 at 08:26
@Quimby, I'm doing simple triangle rasterization, I don't know If I'm doing it correctly or no, but you are right, I'm indexing per triangle, however I'm rasterizing per fragment.. Do you know a good answer to solve that problem ? also when making the global size larger, it's much slower, and also can't adjust it per triangle, only per pixel — andre_lamothe, Sep 17 '22 at 10:26

Quimby · Answer 1 · 2022-09-17T12:29:16.377

I see couple of weird things in the kernel.

You are launching a thread for each pixel but in each thread you pick a single triangle (confusingly based on pixel coordinates) and rasterize it whole in the two nested for-loops.

That results in many idle threads if number of triangles is less than the pixels. Or if there are more, some will not be used. That is an error surely.
In each "pixel thread", you rasterize a whole triangle, again that seems like an error.
Access to dest_buffer is thus very contested and will contain many overwrites if triangles overlap and the memory access will be essentially random, that's the worst case scenario.

Even if you implemented per-triangle kernel correctly, it's not a good idea since the runtime of the thread will correspond to size of the triangle. That is bad since it will unevenly distribute work, furthermore the memory access will be terrible.

Much better and simpler solution IMHO is:

Launch thread per pixel/fragment.
In each thread go through all triangles in a for loop in the same order
Test whether the current pixel is in the triangle, if so compute its depth and color.
At the end, color the pixel based on the closest triangle found. That is easily done with a local variable which keeps min_depth and matching min_color.

This is ideal for GPU because:

The job is parallelized over large amount of threads dues to many pixels,
Each thread executes exactly the same instructions, linear in number of triangles.
(3) can be implemented very easily as point-triangle test, see this answer which can be basically copied to the kernel code. It optionally works for 3D by applying the operations also to .z component.
threads in a local group will process neighboring pixels and thus likely the same triangles and hitting the same branches. Although it might be possible to implement this in branchless way.

__kernel void fragment_shader(__global float3* fragments, __global struct Triangle_* triangles, int triCount)
{
    size_t px = get_global_id(0);
    size_t py = get_global_id(1);

    for (int index = 0; index < triCount; index++)
    { 
        float3 v0Raster = (float3)(triangles[index].v[0].pos[0], triangles[index].v[0].pos[1], triangles[index].v[0].pos[2]);
        float3 v1Raster = (float3)(triangles[index].v[1].pos[0], triangles[index].v[1].pos[1], triangles[index].v[1].pos[2]);
        float3 v2Raster = (float3)(triangles[index].v[2].pos[0], triangles[index].v[2].pos[1], triangles[index].v[2].pos[2]);
        float3 p = (float3)((float)px,(float)py, 0.0f);
        if (PointInTriangle(p, v0Raster, v1Raster, v2Raster))
        {
             fragments[px+get_global_size(0)*py]=triangles[index].color;
        }
    }
}

and launch it with the code you have - global size matching the fragment buffer, local size is irrelevant.

I would recommend first dropping all the lighting stuff, just focus on minimal working example e.g. for colors. Make that fast, only then start adding more stuff.

Do note the code above will not render the triangles correctly depth-wise but instead set the fragment to the last triangle regardless of its depth.

Many thanks, I'm still confused about your approach, can you please edit your answer, to show how would I invoke the kernel for global size,..etc, and how in pseudo code if it's possible inside the kernel what should be there ? — andre_lamothe, Sep 17 '22 at 11:24
I made a for loop for each triangle, do the test, and launched a kernel with framebuffer size as global, now it's 1 FPS, very slow — andre_lamothe, Sep 17 '22 at 11:30
I edited the post with the current kernel, with your suggestion, a for loop for each triangle — andre_lamothe, Sep 17 '22 at 11:31
@AhmedSaleh Sorry, I do not have OpenCL setup on my machine right now so I cannot easily test it. I think the current kernel is OK apart from some unused variables. I do not get why is there `if` about offscreen triangle. I am unsure about the validity of indexing since you are not using local work size. I've added a small pseudo code to my answer. BTW how many triangles do you have and on what machine are you running the code? — Quimby, Sep 17 '22 at 12:29
Maybe do another test, drop the whole code, keep only `fragments[px+get_global_size(0)*py]=(float3)(1.0f,0.0f,0.0f);` (+ the `px,py` initialization) and see how long that takes to color the screen red. — Quimby, Sep 17 '22 at 12:31
I used the pseudo code for rendering a cow model with 5k vertices, now it's 15FPS, before it was 144 FPS.. Can we discuss in chat ?, I'm testing on nVidia RTX 3080 — andre_lamothe, Sep 17 '22 at 12:34
@AhmedSaleh Sorry no, I don't have much time :( Sorry for not being able to help. — Quimby, Sep 17 '22 at 12:37
No worries, but still it's slower than the previous approach :( — andre_lamothe, Sep 17 '22 at 12:37
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248133/discussion-between-ahmed-saleh-and-quimby). — andre_lamothe, Sep 17 '22 at 12:38
Why would you guess, the current approach is slower.. I mean iterating over 5k triangles is bad ? but in previous approach it was wrong, but much faster.. — andre_lamothe, Sep 17 '22 at 12:43

Optimizing an OpenCL Kernel

1 Answers1