Are we saying that a task that requires fairly light computation per row on a huge number of rows is fundamentally unsuited to a GPU?
I have some data processing to do on a table where the rows are independent. So it is embarrasingly parallel. I have a GPU so....match made in heaven? It is something quite similar to this example which calculates moving average for each entry per row (rows are independent.)
import numpy as np
from numba import guvectorize
@guvectorize(['void(float64[:], intp[:], float64[:])'], '(n),()->(n)')
def move_mean(a, window_arr, out):
window_width = window_arr[0]
asum = 0.0
count = 0
for i in range(window_width):
asum += a[i]
count += 1
out[i] = asum / count
for i in range(window_width, len(a)):
asum += a[i] - a[i - window_width]
out[i] = asum / count
arr = np.arange(2000000, dtype=np.float64).reshape(200000, 10)
print(arr)
print(move_mean(arr, 3))
Like this example, my processing for each row is not heavily mathematical. Rather it is looping across the row and doing some sums, assignments and other bits and pieces with some conditional logic thrown in.
I have tried using guVectorize in Numba library to assign this to a Nvidia GPU. It works fine but I'm not getting a speedup.
Is this type of task suited to a GPU in principle? i.e. if I go deeper into Numba and start tweaking the threads, blocks and memory management or the algorithm implementation should I , in theory , get a speed up. Or, is this kind of problem fundamentally just unsuited to the architecture.
The answers below seem to suggest it is unsuited but I am not quite convinced yet.
numba - guvectorize barely faster than jit
And numba guvectorize target='parallel' slower than target='cpu'