0

I am looking into the potential optimization way for my kernel code. Mark Harris's blog provides a good example on 1-D dimension vector. How can I parallelize the code for multi-dimensional data?

For example, I have two rows of data. I want to get the average value for each single row. This pseudo code can describe what I want to do:

tensor res({data.size[0]});

for(int i=0; i<data.size[0]; i++){
    float tmp = 0.0;
    for(int j=0; j<data.size[1]; j++){
      //accumulated summation for i-th row;
      tmp += data.at(i, j);
    }
    //average value for i-th row;
    res.at(i) = tmp / float(data.size[1]);
  }

For the inner loop, I can easily adapt the methods to parallelize the execution. Is there is any suggestion for the outer loop optimization? So I can parallelize the computation for multiple rows.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
kingwales
  • 129
  • 8
  • What's the value of size[0] and size[1]? – Abator Abetor Feb 04 '22 at 19:41
  • 1
    You can launch the rows at the same time, effectively parallelizing them. This is effectively a segmented reduction, and libraries like thrust and cub can do this for you, or you can find various posts here on the `cuda` tag that explain how to do it from first principles. [Here](https://stackoverflow.com/questions/18930558/summing-the-rows-of-a-matrix-stored-in-either-row-major-or-column-major-order) is one example, there are many others. – Robert Crovella Feb 04 '22 at 20:17

0 Answers0