How does GPU parallelize different tasks?

Question

I am really interested to understand how the GPU parallelizes different tasks such as real-time rendering and training the neural networks. I know the math behind parallelization but I am curious to know how GPU actually works. Real-time rendering and training neural networks are really different. How does GPU parallelize these two tasks efficiently?

not enough for a complete answer : GPU rendering operations involve a lot of linear algebra for instance. Training AI models as well. GPU is optimized to treat parallelized linear algebra operations. — Pac0, May 07 '20 at 06:56
note that using the title of your question on google leads several interesting results : https://research.nvidia.com/sites/default/files/pubs/2007-02_How-GPUs-Work/04085637.pdf, http://renderingpipeline.com/2012/11/understanding-the-parallelism-of-gpus/ etc... — Pac0, May 07 '20 at 06:57

ProjectPhysX · Accepted Answer · 2020-05-07T07:44:33.943

GPU parallelization requires the problem to be split up in as many independent, equal computations as possible (SIMD). What in C++ looks like

void example(float* data, const int N) {
    for(int n=0; n<N; n++) {
        data[n] += 1.0f;
    }
}

in OpenCL C looks like this:

kernel void example(global float* data) {
    const int n = get_global_id(0);
    data[n] += 1.0f;
}

A few examples:

For real-time rendering, a tesselated surface can be rendered by the GPU by drawing every triangle using a seperate GPU core. https://youtu.be/1ww8qRCMc4s

Neural networks come down to large matrix multiplications and within a matrix individual colums or tiles can be computed in parallel independently at the same time. Vector additions for example are parallelized in as many vector components as there are and each GPU core computes only a single vecotor component.

Lattice based fluid simulations such as LBM work on a 3D lattice of lets say 256x256x256 lattice points. For each of these 16777216‬ lattice points the computations are the same and they can be done concurrently because they are independent of each other. So the simulation is split up to 16777216‬ threads on the GPU, one for every lattice point. If the GPU has 4096 cores, it can compute 4096 of these concurrently. As you can imagine, this is orders of magnitude faster than running such tasks on CPUs. https://youtu.be/a1u2g9ahIDk

A particle simulation can compute each particle on a separate GPU core. This works as long as the particles are mostly independent. https://youtu.be/8Szib8Km5Mo

For good saturation, to reach maximum efficiency, the number of threads should be much larger than the number of GPU cores available. Also branching for example takes a performance hit because in groups of 32 GPU cores, if one is the true branch and all the others are in the false branch, both branches have to be computed by all cores within the group. In the tesselated surface rendering example, if the triangles have vastly different sizes, performance takes hit for a similar reason: the entire group has to wait for the one GPU core with the largest triangle to finish. If all triangles are approximately the same size however, performance is very good.

How does GPU parallelize different tasks?

1 Answers1

Linked