I am looking into the potential optimization way for my kernel code. Mark Harris's blog provides a good example on 1-D dimension vector. How can I parallelize the code for multi-dimensional data?
For example, I have two rows of data. I want to get the average value for each single row. This pseudo code can describe what I want to do:
tensor res({data.size[0]});
for(int i=0; i<data.size[0]; i++){
float tmp = 0.0;
for(int j=0; j<data.size[1]; j++){
//accumulated summation for i-th row;
tmp += data.at(i, j);
}
//average value for i-th row;
res.at(i) = tmp / float(data.size[1]);
}
For the inner loop, I can easily adapt the methods to parallelize the execution. Is there is any suggestion for the outer loop optimization? So I can parallelize the computation for multiple rows.