How do I know if my Embarassingly Parallel Task is Suitable for GPU?

Question

Are we saying that a task that requires fairly light computation per row on a huge number of rows is fundamentally unsuited to a GPU?

I have some data processing to do on a table where the rows are independent. So it is embarrasingly parallel. I have a GPU so....match made in heaven? It is something quite similar to this example which calculates moving average for each entry per row (rows are independent.)

import numpy as np

from numba import guvectorize

@guvectorize(['void(float64[:], intp[:], float64[:])'], '(n),()->(n)')
def move_mean(a, window_arr, out):
    window_width = window_arr[0]
    asum = 0.0
    count = 0
    for i in range(window_width):
        asum += a[i]
        count += 1
        out[i] = asum / count
    for i in range(window_width, len(a)):
        asum += a[i] - a[i - window_width]
        out[i] = asum / count

arr = np.arange(2000000, dtype=np.float64).reshape(200000, 10)
print(arr)
print(move_mean(arr, 3))

Like this example, my processing for each row is not heavily mathematical. Rather it is looping across the row and doing some sums, assignments and other bits and pieces with some conditional logic thrown in.

I have tried using guVectorize in Numba library to assign this to a Nvidia GPU. It works fine but I'm not getting a speedup.

Is this type of task suited to a GPU in principle? i.e. if I go deeper into Numba and start tweaking the threads, blocks and memory management or the algorithm implementation should I , in theory , get a speed up. Or, is this kind of problem fundamentally just unsuited to the architecture.

The answers below seem to suggest it is unsuited but I am not quite convinced yet.

numba - guvectorize barely faster than jit

And numba guvectorize target='parallel' slower than target='cpu'

If your data isn’t already on gpu you won’t get any speed up from gpu because the data transfers cpu<->gpu is quite costly — ead, Apr 27 '18 at 04:58

ead · Accepted Answer · 2018-04-27T14:41:02.860

Your task is obviously memory-bound, but it doesn't mean that you cannot profit from GPU, however it is probably less straight forward than for a CPU-bound task.

Let's look at common configuration and do some math:

CPU-RAM memory bandwidth of ca. 24GB/s
CPU-GPU transfer bandwidth of ca. 8GB/s
GPU-RAM memory bandwidth of ca. 180GB/s

Let's assume we need to transfer 24 GB of data to complete the task, so we will have the following optimal times (whether and how to achieve these times is another question!):

scenario: only CPU time = 24GB/24GB/s = 1 second.
scenario: Data must be transferred from CPU to GPU (24GB/8GB/s = 3 seconds) and processed there (24GB/180GB/s = 0.13 second) leads to 3.1 seconds.
scenario: Data is already on the device, so only 24GB/180GB/s = 0.13 seconds are needed.

As one can see, there is a potential for a speed-up but only in the 3. scenario - when your data is already on the GPU-device.

However, achieving the maximal bandwidth is a quite challenging enterprise.

For example, when processing the matrix row-wise, on CPU, you would like your data to be in the row-major-order (C-order) in order to get the most out of the L1-cache: while reading a double you actually get 8 doubles loaded into the cache and you don't want them to be evicted from the cache, before you could process the remaining 7.

On GPU, on the other hand, you want the memory accesses to be coalesced, e.g. thread 0 should access address 0, thread 1 - address 1 and so on. For this to work, the data must be in column-major-order (Fortran-order).

There is another thing to be considered: the way you test the performance. Your test array is only about 2MB large and thus small enough for the L3 cache. The bandwidth of the L3 cache depends on the number of cores used for the calculation, but will be at least around 100GB/s - not much slower than GPU and probably much faster when parallelized on CPU.

You need a bigger dataset to not get fooled by cache behavior.

A somewhat off-topic remark: your algorithm is not very robust from the numerical point of view.

If the window width were 3, as in your example, but there were about 10**4 elements in a row. So for the last element, the value is result of about 10**4 additions and subtractions, each of which adds a rounding error to the value - compared to only three three additions if done "naively" it is quite a difference.

Of cause, it might not be of significance (for 10 elements in a row as in your example), but also might bite you one day...

It make sense to me. What if there are many more sequential instructions to carry out per row i.e. nothing mathemtically complex, just a lot more to do with each row? Would that in any way start to utilize the processing power of the GPU more effiicently? — John Curry, Apr 29 '18 at 15:18
As long as your task is memory-bound, gpu/cpu memory bandwidth is what you can achieve as speed up. It is only worth transferring the data to gpu only if it must be read multiple times during the algorithm execution. — ead, Apr 29 '18 at 16:38

How do I know if my Embarassingly Parallel Task is Suitable for GPU?

1 Answers1