Concerning CUDA 10.1
I'm doing some calculations on geometric meshes with a large amount of independent calculations done per face of the mesh. I run a CUDA kernel which does the calculation for each face.
The calculations involve some matrix multiplication, so I'd like to use cuBLAS or cuBLASLt to speed things up. Since I need to do many matrix multiplications (at least a couple per face) I'd like to do it directly in the kernel. Is this possible?
It doesn't seem like cuBLAS or cuBLASLt allows you to call their functions from kernel (__global__) code. I get the following error from Visual Studio:
"calling a __host__ function from a __device__ function is not allowed"
There are some old answers (Could a CUDA kernel call a cublas function?) that imply that this is possible though?
Basically, I'd like a kernel like this:
__global__
void calcPerFace(...)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < faceCount; i += stride)
{
// Calculate some matrices for each face in the mesh
...
// Multiply those matrices
cublasLtMatmul(...) // <- not allowed by cuBLASLt
// Continue calculation
...
}
}
Is it possible to call cublasLtMatmul or perhaps cublassgemm from a kernel like this in CUDA 10.1?