So I have a bunch of Arrays A_i of size M x N_i which are entirely independent and don't need to communicate with each other.
I've designed a kernel that I want to operate on each array separately, however since multiple kernels can't run at the same time, I would like to design a single kernel to operate on either all these arrays at once, or on batches of these arrays at a time.
Since these arrays all have the same number of rows, I could column-stack them into a single mega-array, and then use a single kernel with the appropriate off-sets to operate on each portion of this mega-array separately, however I'm looking for a cleaner solution. Especially since the number of work-groups and work-items used for each A_i depends on its column-dimension N_i.
Hopefully this explanation is clear.