Image computation on GPU and value returning

Question

I have a C# project in which I retreive grey-scale images from cameras and do some computation with the image data. The computations are quite time-consuming since I need to loop over the total image several times and I am doing it all on the CPU.

Now I would like to try to get the evaluation running on the GPU, but I have a lot of struggle achieving that, since I never did any GPU calculations before.

The software should be able to run on several computers with varying hardware, so CUDA for example is not a solution for me, since the code should also run on laptops which only have onboard graphics. After some research I came accross Cloo (found it on this project), which seems to be a quite resonable choice.

So far I integrated Cloo in my project and tried to get this hello world example running. I guess it is running, since I don´t get any exception, but I don´t know where I can see the printed output.

For my computations I need to pass the image to the GPU and I also need the x-y coordinates during the computation. So, in C# the computation looks like this:

int a = 0;
for (int y = 0; y < img_height; y++){
    for (int x = 0; x < img_width; x++){
        a += image[x,y] * x * y;
    }
}

int b = 0;
for (int y = 0; y < img_height; y++){
    for (int x = 0; x < img_width; x++){
        b += image[x,y] * (x-a) * y;
    }
}

Now I want to have these calculations to run on the GPU, and I want to parallel the y-loop, so that in every task one x-loop is running. Then I could take all the resulting a values and add them up before the second loop block would start.

Afterwards I would like to return the values a and b to my C# code and use them there.

So, to wrap up my questions:

Is Cloo a recommendable choice for this task?
What is the best way to pass the image-data (16bit, short-array) and the dimensions (img_width, img_height) to the GPU?
How can I return a value from the GPU? As far as I know kernels are always used as kernel void...
What would be the best way to implement the loops?

I hope my questions are clear and I provided sufficient information to understand my struggles. Any help is appreciated. Thanks in advance.

Two questions: **(A)** Have you already implemented the cited kernels to at least get hands on the questioned territory? **(B)** What are static sizes of the gray-scale 16-bit colordepth images [x:0,?][y:0,?] and how many times the "re-processing" mentioned in the motivation of the O/P will actually happen 2x? 20x? 200x? — user3666197, Mar 02 '18 at 10:30
So far I only implemented the kernel of the hello world and nothing beyond that. The images have a size of 1920x1200 pixel and I need to have two loops over the complete image as in the code example and these two loops are in another loop that runs around 10-times. — J. Buldt, Mar 02 '18 at 14:08

user3666197 · Accepted Answer · 2018-03-07T13:15:36.827

Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b

Ad 4 ) the tandem of identical `for`-loops has a poor performance

given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.

Cache-Naive re-formulation:

int a = 0;
int c = 1;

for (     int  y = 0; y < img_height; y++ ){
    for ( int  x = 0; x < img_width;  x++ ){
          int      intermediate = image[x,y] * y; // .SET   PROD(i[x,y],y) 
          a += x * intermediate;                  // .REUSE 1st
          c -=     intermediate;                  // .REUSE 2nd
    }
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)

Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.

Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this

OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.

Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.

The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.

In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.

Ad 1) Given 2+3 and 4 above, 1 may easily loose sense

As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.

Thanks for the detailed answer. One thing I don´t get so far. How can I combine the two loops into one? Since the result for `b` strongly depends on the result for `a` after the complete loop. In your example you also have a `int c`, which is not used. Also, why not put `a` and `b` also into the register? These are used quite a lot, too. Thanks a lot. — J. Buldt, Mar 07 '18 at 07:22
Okay, now I got it, tested the loops and got the same result. After a night with more sleep I will probably understand why it works... Still one thing is not working, at `register int...`, here it doesn´t know the keyword register. Do I need to do something else to use `register`? — J. Buldt, Mar 07 '18 at 10:36
Following this answer here, `register` seems to be working only in C++. In C# the assignment of the register is done by the compiler. https://stackoverflow.com/a/26186293/9228257 — J. Buldt, Mar 07 '18 at 12:52
**[`register`]** Sorry, an old habit to set compiler directive for fastest placement of new variable into CPU register, kindly skip it, was my fault. For a next level, one may get inspired by Bloomberg IT research into faster memory allocators than default are, but this goes way beyond the sketched solution approach. **[Why it works]** The processing is just better aligned and ( re-)uses the algebraic properties of the dependency-chain. This dependency also shows why there is no much space for SIMD GPU-type of problem re-formulation, as demonstrated above. — user3666197, Mar 07 '18 at 13:03
I just implemented the combination of the loops in my code. The calculations take now only 20 ms instead of 120 ms, so it helped quite a lot. Thanks for the help. — J. Buldt, Mar 07 '18 at 15:52

Image computation on GPU and value returning

1 Answers1

Ad 4 ) the tandem of identical for-loops has a poor performance

Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this

Ad 1) Given 2+3 and 4 above, 1 may easily loose sense

Ad 4 ) the tandem of identical `for`-loops has a poor performance