I've just started GPU programming with Numba
in the last days and I already learned some techniques from scattered information around the blogs, some in the C programming guide and also a lot from here in the Stack community.
To simplify, I'm trying to improve the performance of my simulations that before I was doing with a regular Python
code. With Numba
, I already improve the performance of my code that now runs 45x faster in my Geforce GTX 1660TI, but now I'm trying to improve a little bit more, as mentioned Here, my kernels doesn't have a good memory acess pattern.
Recently I was trying to understand the use of shared memory for performance improvement in some kernels as in this post, but I don't know if this example helps me out because from what I understand, it takes an explicity advantage of the shared memory and in my regular kernels I usually do elementwise multiplications, with usually more than one matrix or vector.
Actually I don't know if this is something that I should ask here, so please forgive me if here isn't the right place.
One of the main kernels of my code and its test implementation is on the code bellow
from timeit import default_timer as timer
import numba
from numba import jit, guvectorize, int32, int64, float64, prange
from numba import cuda
import numpy as np
from numpy import *
import math
stream = cuda.stream()
D = 9
nx = 20000
ny = 1000
ly = ny-1
uLB = 0.04
cx = np.array([0, 1,-1, 0, 0, 1,-1, 1,-1],dtype=np.float64);
cy = np.array([0, 0, 0, 1,-1, 1,-1,-1, 1],dtype=np.float64);
c = np.array([cx,cy]);
w = np.array([4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36],dtype=np.float64);
def inivel(d, x, y):
return (1-d) * uLB * (1 + 1e-4*sin(y/ly*2*pi))
@cuda.jit
def equilibrium_gpu(rho,u,c,w,feq):
nx2 = rho.shape[0]
ny2 = rho.shape[1]
cuda.syncthreads()
j, k = cuda.grid(2)
if (j < nx2) & (k < ny2):
for i in range(9):
feq[i, j, k] = rho[j,k]*w[i] * (1 + (3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k])) + 0.5*(3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k]))**2 - (3/2 * (u[0,j,k]**2 + u[1,j,k]**2)))
cuda.syncthreads()
vel = fromfunction(inivel, (2,nx,ny))
rho = np.ones([nx, ny], dtype='float64')
res = np.zeros([D, nx, ny], dtype='float64')
feq = np.zeros((9,nx,ny))
rho_device = cuda.to_device(rho, stream=stream)
u_device = cuda.to_device(vel, stream=stream)
c_device = cuda.to_device(c, stream=stream)
w_device = cuda.to_device(w, stream=stream)
feq_device = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64, stream=stream)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(nx / threadsperblock[0])
blockspergrid_y = math.ceil(ny / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
s = timer()
cuda.synchronize()
equilibrium_gpu[blockspergrid, threadsperblock,stream](rho_device,u_device,c_device,w_device,feq_device)
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
I would like to know how I could improve the performance of this kernel by means of shared memory or from another way.