I am using numba cuda to calculate a function.
The code is simply to add up all the values into one result, but numba cuda gives me a different result from numpy.
numba code
import math
def numba_example(number_of_maximum_loop,gs,ts,bs):
from numba import cuda
result = cuda.device_array([3,])
@cuda.jit(device=True)
def BesselJ0(x):
return math.sqrt(2/math.pi/x)
@cuda.jit
def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
i = cuda.grid(1)
if i < number_of_maximum_loop:
result[0] += BesselJ0(i/100+gs)
result[1] += BesselJ0(i/100+ts)
result[2] += BesselJ0(i/100+bs)
# Configure the blocks
threadsperblock = 128
blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock
# Start the kernel
cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs)
return result.copy_to_host()
numba_example(1000,20,20,20)
output:
array([ 0.17770302, 0.34166728, 0.35132036])
numpy code
import math
def numpy_example(number_of_maximum_loop,gs,ts,bs):
import numpy as np
result = np.zeros([3,])
def BesselJ0(x):
return math.sqrt(2/math.pi/x)
for i in range(number_of_maximum_loop):
result[0] += BesselJ0(i/100+gs)
result[1] += BesselJ0(i/100+ts)
result[2] += BesselJ0(i/100+bs)
return result
numpy_example(1000,20,20,20)
output:
array([ 160.40546935, 160.40546935, 160.40546935])
I don't know where I am being wrong. I guess I might use reduction. But it seems impossible to finish it with one cuda kernel.