My goal is is simple. I need to calculate the sum of all elements in every column of a 2D matrix of known size as such:
I already completed the first part of the algorithm which successfully builds the 2D matrix in global memory, filling it with floats. Since, the 2D matrix is enormous (~800 million floats), I think the best approach is to implement the column sums in the same kernel so that there is no extra device-> host and host->device transfer delay.
If I understand correctly, the best solution is to return a 1D vector of size #rows with each entry representing the corresponding sum of the column.
If the above is true, can someone recommend a way to successfully implement this? Thanks in advance.
Limitations: Only running about 5000 threads. the number of columns is ~ 160,000 while the number of rows is ~5000.