Elementwise matrix multiplication may have performance gain by means of shared memory?

Question

I've just started GPU programming with Numba in the last days and I already learned some techniques from scattered information around the blogs, some in the C programming guide and also a lot from here in the Stack community.

To simplify, I'm trying to improve the performance of my simulations that before I was doing with a regular Python code. With Numba, I already improve the performance of my code that now runs 45x faster in my Geforce GTX 1660TI, but now I'm trying to improve a little bit more, as mentioned Here, my kernels doesn't have a good memory acess pattern.

Recently I was trying to understand the use of shared memory for performance improvement in some kernels as in this post, but I don't know if this example helps me out because from what I understand, it takes an explicity advantage of the shared memory and in my regular kernels I usually do elementwise multiplications, with usually more than one matrix or vector.

Actually I don't know if this is something that I should ask here, so please forgive me if here isn't the right place.

One of the main kernels of my code and its test implementation is on the code bellow

from timeit import default_timer as timer
import numba
from numba import jit, guvectorize, int32, int64, float64, prange
from numba import cuda
import numpy as np
from numpy import *
import math

stream = cuda.stream()

D = 9
nx = 20000
ny = 1000
ly = ny-1
uLB = 0.04

cx = np.array([0, 1,-1, 0, 0, 1,-1, 1,-1],dtype=np.float64);
cy = np.array([0, 0, 0, 1,-1, 1,-1,-1, 1],dtype=np.float64);
c = np.array([cx,cy]);
w = np.array([4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36],dtype=np.float64);


def inivel(d, x, y):
    return (1-d) * uLB * (1 + 1e-4*sin(y/ly*2*pi))

@cuda.jit
def equilibrium_gpu(rho,u,c,w,feq):
    
    nx2 = rho.shape[0]
    ny2 = rho.shape[1]
    
    cuda.syncthreads()

    j, k = cuda.grid(2)
    if (j < nx2) & (k < ny2):
        for i in range(9):
            feq[i, j, k] = rho[j,k]*w[i] * (1 + (3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k])) + 0.5*(3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k]))**2 - (3/2 * (u[0,j,k]**2 + u[1,j,k]**2)))
            
    cuda.syncthreads()

vel = fromfunction(inivel, (2,nx,ny))
rho = np.ones([nx, ny], dtype='float64')
res = np.zeros([D, nx, ny], dtype='float64')
feq = np.zeros((9,nx,ny))

rho_device = cuda.to_device(rho, stream=stream)
u_device = cuda.to_device(vel, stream=stream)
c_device = cuda.to_device(c, stream=stream)
w_device = cuda.to_device(w, stream=stream)
feq_device = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64, stream=stream)

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(nx / threadsperblock[0])
blockspergrid_y = math.ceil(ny / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

s = timer()
cuda.synchronize()
equilibrium_gpu[blockspergrid, threadsperblock,stream](rho_device,u_device,c_device,w_device,feq_device)
cuda.synchronize()
gpu_time = timer() - s

print(gpu_time)

I would like to know how I could improve the performance of this kernel by means of shared memory or from another way.

The obvious performance improvement would be to change your memory layout from [D, nx, ny] to [nx, ny, D] — talonmies, Jul 07 '21 at 08:22
now that you said, it sounds obvious for me also :), thanks for the advice — Bruno Magacho, Jul 08 '21 at 06:32

Jérôme Richard · Accepted Answer · 2021-07-08T17:53:24.820

You can put w[i] and c[1,i] in shared memory, but I doubt this will be significantly faster since the global memory accesses should already be cached in the L1. The others accesses seems fine.

The biggest performance issue comes from the computation of double-precision numbers with a mainstream GPU that is barely able to efficiently compute them. Indeed, the Geforce GTX 1660TI can achieve up to 4.6 TFLOPS with simple-precision numbers and only 0.144 TFLOPS with double-precision numbers. This simple-precision computation is about 32 times faster on this GPU!

Note that there are GPUs capable of doing quite-efficient double-precision computations, like the Nvidia Volta GPUs, that are also much more expensive (double precision is still 2 times slower even on these GPUs).

Beside this, it might be better to unroll the loop yourself, put some variable in registers in order not to recompute them and prefer a*a instead of a**2 regarding the generated code. Moreover, the syncthreads calls seems useless (and are costly).

EDIT: I missed the very inefficient feq[i, j, k] access and the fact that j should better be used for the contiguous dimensions since grid appear to return the value x, y and not y, x (where x is supposed to be the "contiguous" dimension, although the documentation is really unclear about this). It is probably much better to write f (and read the other arrays like u) contiguously using feq[i, k, j] and then transpose the array (possibly on the GPU). I think an optimized full transposition of the arrays would be faster than using shared memory to transpose the blocks locally. Alternatively, you could redesign your code so that no transposition is needed at all.

Thanks for the clarifying answer @Jérôme Richard. Actually I was aware that double precision it would be more costly, but I wasn't wondering that it would be that much. I will look at the possibilities and impacts of changing my variables to single precision. And thanks for the arithmetic suggestions also, I will also implement those. — Bruno Magacho, Jul 07 '21 at 04:08
Note that if you find out that you cannot use simple-precision number for some reason (eg. precision), it may be better/faster to use a multi-threaded CPU implementation instead of a GPU one. You could use the `parallel` attribute and the `prange` function of Numba to do that. I also added an important note. — Jérôme Richard, Jul 08 '21 at 17:54
Thanks for the supplementary advices, I will take a good look to implement those carefully and try to see if it will result in a performance gain. I also have a version with CPU using `numba` which already made my code to run faster, but with my poor knowledge of GPU programming I already could implement a version that is at least 5 ou 6 times faster than the CPU one, still using double precision. I will see how it goes, but again, thanks for the suggetions. — Bruno Magacho, Jul 08 '21 at 18:45

score 1 · Answer 2 · answered Jul 09 '21 at 02:55

This example has most of the modifications suggested in the other answer, excepting the switch to single-precision (which is fairly easy to do). It seems to run about 3x faster than your kernel, on my particular machine (GTX 960). You may wish to change nx and ny to match whatever test case you are running. I've also adjusted the code around the kernel launch to be better for benchmarking, in my opinion. Not all of the changes seem to be meaningful from what I can tell. For example converting x**2 -> x*x doesn't seem to matter.

from timeit import default_timer as timer
import numba
from numba import jit, guvectorize, int32, int64, float64, prange
from numba import cuda
import numpy as np
from numpy import *
import math

stream = cuda.stream()

D = 9
nx = 4000
ny = 1000
ly = ny-1
uLB = 0.04

cx = np.array([0, 1,-1, 0, 0, 1,-1, 1,-1],dtype=np.float64);
cy = np.array([0, 0, 0, 1,-1, 1,-1,-1, 1],dtype=np.float64);
c = np.array([cx,cy]);
w = np.array([4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36],dtype=np.float64);


def inivel(d, x, y):
    return (1-d) * uLB * (1 + 1e-4*sin(y/ly*2*pi))

@numba.jit('void(float64[:,:], float64[:,:], float64[:,:], float64[:], float64[:], float64[:], float64[:,:,:])',parallel=True)
def equilibrium_cpu(rho, ux, uy, cx, cy, w, feq ):              # Equilibrium distribution function.
    nx2 = rho.shape[0]
    ny2 = rho.shape[1]

    for i in prange(9):
        for j in prange(0,nx2):
            for k in prange(0,ny2):
                feq[i,j,k] = rho[j,k]*w[i] * (1 + (3 * (cx[i]*ux[j,k] + cy[i]*uy[j,k])) + 0.5*(3 * (cx[i]*ux[j,k] + cy[i]*uy[j,k]))**2 - (3/2 * (ux[j,k]**2 + uy[j,k]**2)))



@cuda.jit
def equilibrium_gpu(rho,u,c,w,feq):

    nx2 = rho.shape[0]
    ny2 = rho.shape[1]

    k,j = cuda.grid(2)
    if (j < nx2) & (k < ny2):
        r = rho[j,k]
        ux = u[0,j,k]
        uy = u[1,j,k]
        for i in range(9):
             cv = (3 * (c[0,i]*ux + c[1,i]*uy))
             feq[i, j, k] = r*w[i] * (1 + cv + 0.5*(cv*cv) - (3/2 * (ux*ux + uy*uy)))

vel = fromfunction(inivel, (2,nx,ny))
rho = np.ones([nx, ny], dtype='float64')
res = np.zeros([D, nx, ny], dtype='float64')
feq = np.zeros((9,nx,ny))

rho_device = cuda.to_device(rho, stream=stream)
u_device = cuda.to_device(vel, stream=stream)
c_device = cuda.to_device(c, stream=stream)
w_device = cuda.to_device(w, stream=stream)
feq_device = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64, stream=stream)

threadsperblock = (32, 32)
blockspergrid_x = math.ceil(ny / threadsperblock[0])
blockspergrid_y = math.ceil(nx / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

s = timer()
equilibrium_cpu(rho, vel[0], vel[1], cx, cy, w, feq)

cpu_time = timer() - s

print(cpu_time)
equilibrium_gpu[blockspergrid, threadsperblock,stream](rho_device,u_device,c_device,w_device,feq_device)


cuda.synchronize()
s = timer()
equilibrium_gpu[blockspergrid, threadsperblock,stream](rho_device,u_device,c_device,w_device,feq_device)
cuda.synchronize()
gpu_time = timer() - s

print(gpu_time)

s = timer()
feq_host2 = feq_device.copy_to_host()
print(timer()-s)

gain = cpu_time/gpu_time
print(gain)
print(np.allclose(feq_host2, feq))

Elementwise matrix multiplication may have performance gain by means of shared memory?

2 Answers2