I'm trying to parallelize some function via CUDA, that is being called for many times. Each time it deals with the same matrix. I want to store this matrix in GPU memory and when function is called, I want to upload vector to GPU and multiply it by matrix and return the result. I prefer C++ template style, so thrust has higher priority.
Please recommend me some functions to do this and if possible some little illustrating samples. I don't provide the code not because it's a secret but because of its complexity and huge size.