Advice on CUDA algorithm to sum columns of a matrix

Question

My goal is is simple. I need to calculate the sum of all elements in every column of a 2D matrix of known size as such:

enter image description here

I already completed the first part of the algorithm which successfully builds the 2D matrix in global memory, filling it with floats. Since, the 2D matrix is enormous (~800 million floats), I think the best approach is to implement the column sums in the same kernel so that there is no extra device-> host and host->device transfer delay.

If I understand correctly, the best solution is to return a 1D vector of size #rows with each entry representing the corresponding sum of the column.

If the above is true, can someone recommend a way to successfully implement this? Thanks in advance.

Limitations: Only running about 5000 threads. the number of columns is ~ 160,000 while the number of rows is ~5000.

Summing the columns in column-major ordering algorithmically is equivalent to summing the rows in row-major ordering. This typically implies a reduction per column, unless you can work equally well with transposed data, in which case we can write a very efficient kernel. This question has been asked many times, such as [here](http://stackoverflow.com/questions/17862078/cuda-code-for-sum-of-rows-of-a-matrix-too-slow/17864583#17864583) You can also search the cuda tag e.g. "sum rows columns" for more examples. If none apply, you might point out how you think this question is unique. — Robert Crovella, Jul 11 '14 at 18:29
Since you are constructing the matrix yourself, if you can construct it with row-major ordering instead, the kernel can become quite simple and also highly efficient for this task, such as [here](http://stackoverflow.com/questions/18930558/summing-the-rows-of-a-matrix-stored-in-either-row-major-or-column-major-order). — Robert Crovella, Jul 11 '14 at 18:39
I was hoping there would be some specialized CUDA function I could call within the kernel that was more efficient than a serialized, iterated sum. Does such a thing exist? @robert — Jordan, Jul 11 '14 at 18:49
Also, when people say "reduction" are they talking about serialized, iterated sums? — Jordan, Jul 11 '14 at 18:50
A "reduction" is any algorithm that produces a relatively small output data set (perhaps only one value, such as a sum) from a relatively large input data set. However, in the context of CUDA GPU programming, when I say "reduction" I am often referring to a [parallel reduction](http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction) (note the pdf file as well). Both concepts are applicable to my usage above. It is a reduction per column, and in the case of summing columns in column-major order, a *parallel reduction* algorithm may be of interest. — Robert Crovella, Jul 11 '14 at 18:54
A serialized, iterated sum per row (in column-major storage) or per column (in row-major storage) is pretty much the best algorithm realization for this type of task, assuming the *number of rows* (case 1) or the *number of columns* (case 2) is of reasonably large size. 160,000 columns to use your number is definitely large enough. Your task can be considerably simplified *and* optimized if you can switch underlying storage to row-major (for summing columns). — Robert Crovella, Jul 11 '14 at 18:58
Yes, the ease of establishing data (read) coalescing is a major reason why I suggest that summing columns in row-major order, or summing rows in column-major order, is best. Parallel reduction algorithms can also load data in a coalesced fashion (that's why we use them as opposed to an iterated sum depending on the underlying data organization), but they do not result in a trivially-simple iterated sum realization, and have a number of computational and communication overheads associated with them, that a trivial iterated sum does not. Summing columns in row-major storage allows both. — Robert Crovella, Jul 11 '14 at 19:28

Advice on CUDA algorithm to sum columns of a matrix

0 Answers0