I would use the SVD routine of CUDA 7.0 (cuSolver), i need to perform the SVD on all parts where i split the matrix (for example, dividing the matrix into 2x2 blocks, I want to perform four times the SVD in parallel) . The idea would be to invoke several times the kernel in relation to the matrix subdivision. so:
for loop(istart){
for loop(jstart){
"invoke kernel"
}
}
But in this way the call to the kernel is serial, not parallel. Since there isn't the possibility to invoke these functions from the kernel, how can I parallelise these calls?