0

My goal is is simple. I need to calculate the sum of all elements in every column of a 2D matrix of known size as such:

enter image description here

I already completed the first part of the algorithm which successfully builds the 2D matrix in global memory, filling it with floats. Since, the 2D matrix is enormous (~800 million floats), I think the best approach is to implement the column sums in the same kernel so that there is no extra device-> host and host->device transfer delay.

If I understand correctly, the best solution is to return a 1D vector of size #rows with each entry representing the corresponding sum of the column.

If the above is true, can someone recommend a way to successfully implement this? Thanks in advance.

Limitations: Only running about 5000 threads. the number of columns is ~ 160,000 while the number of rows is ~5000.

Jordan
  • 305
  • 3
  • 13
  • 1
    Column or row major ordering? – talonmies Jul 11 '14 at 17:57
  • column major ordering. – Jordan Jul 11 '14 at 17:59
  • 1
    Summing the columns in column-major ordering algorithmically is equivalent to summing the rows in row-major ordering. This typically implies a reduction per column, unless you can work equally well with transposed data, in which case we can write a very efficient kernel. This question has been asked many times, such as [here](http://stackoverflow.com/questions/17862078/cuda-code-for-sum-of-rows-of-a-matrix-too-slow/17864583#17864583) You can also search the cuda tag e.g. "sum rows columns" for more examples. If none apply, you might point out how you think this question is unique. – Robert Crovella Jul 11 '14 at 18:29
  • Since you are constructing the matrix yourself, if you can construct it with row-major ordering instead, the kernel can become quite simple and also highly efficient for this task, such as [here](http://stackoverflow.com/questions/18930558/summing-the-rows-of-a-matrix-stored-in-either-row-major-or-column-major-order). – Robert Crovella Jul 11 '14 at 18:39
  • I was hoping there would be some specialized CUDA function I could call within the kernel that was more efficient than a serialized, iterated sum. Does such a thing exist? @robert – Jordan Jul 11 '14 at 18:49
  • Also, when people say "reduction" are they talking about serialized, iterated sums? – Jordan Jul 11 '14 at 18:50
  • 1
    A "reduction" is any algorithm that produces a relatively small output data set (perhaps only one value, such as a sum) from a relatively large input data set. However, in the context of CUDA GPU programming, when I say "reduction" I am often referring to a [parallel reduction](http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction) (note the pdf file as well). Both concepts are applicable to my usage above. It is a reduction per column, and in the case of summing columns in column-major order, a *parallel reduction* algorithm may be of interest. – Robert Crovella Jul 11 '14 at 18:54
  • 1
    A serialized, iterated sum per row (in column-major storage) or per column (in row-major storage) is pretty much the best algorithm realization for this type of task, assuming the *number of rows* (case 1) or the *number of columns* (case 2) is of reasonably large size. 160,000 columns to use your number is definitely large enough. Your task can be considerably simplified *and* optimized if you can switch underlying storage to row-major (for summing columns). – Robert Crovella Jul 11 '14 at 18:58
  • Is this because of data coalescing? – Jordan Jul 11 '14 at 19:24
  • 1
    Yes, the ease of establishing data (read) coalescing is a major reason why I suggest that summing columns in row-major order, or summing rows in column-major order, is best. Parallel reduction algorithms can also load data in a coalesced fashion (that's why we use them as opposed to an iterated sum depending on the underlying data organization), but they do not result in a trivially-simple iterated sum realization, and have a number of computational and communication overheads associated with them, that a trivial iterated sum does not. Summing columns in row-major storage allows both. – Robert Crovella Jul 11 '14 at 19:28

0 Answers0