9

I have an application where I need take the average intensity of an image for around 1 million images. It "feels" like a job for a GPU fragment shader, but fragment shaders are for per-pixel local computations, while image averaging is a global operation.

One approach I considered is loading the image into a texture, applying a 2x2 box-blur, load the result back into a N/2 x N/2 texture and repeating until the output is 1x1. However, this would take log n applications of the shader.

Is there a way to do it in one pass? Or should I just break down and use CUDA/OpenCL?

Kyle Simek
  • 9,576
  • 4
  • 26
  • 32
  • My application is doing Chamfer matching of a projected 3D model to an input image. I render an image containing the silhouette edges of my model, and for every edge pixel, I use a lookup table to find the nearest edge pixel in the input image. Then I need the average result, which tells me well the model fits the data. I tried reading the rendered edge pixels from opengl and doing chamfer matching on the CPU, but the read operation was a big bottleneck. I was hoping that by doing the whole thing on the GPU and just reading in a single value, I'd get a big speedup. – Kyle Simek Sep 16 '10 at 02:18
  • (ctd) Since I can pass the lookup table as a texture, I can do the lookups in a vertex shader, but I still have the bottleneck of reading the data back into main memory. – Kyle Simek Sep 16 '10 at 02:25
  • Nothing forces you to do blur 2x2,- you could do for example blur with 16x16 box and after that load result into N/16 x N/16 texture. In that way you may achieve big speed-up and less copy operations... – Agnius Vasiliauskas Sep 17 '10 at 09:08
  • @0x69 I thought of this, but in my experience, using large blurring kernels like this actually lose performance, as the neighbor lookups occur serially. Consider the extreme case, where I use a 512x512 box that covers my whole image. Now I've done the entire image averaging operation *serially* in my shader, which totally defeats the purpose of the parallel nature of the GPU's fragment shader. – Kyle Simek Sep 17 '10 at 12:49
  • As always truth is somewhere in the middle :) Nature of this task is such that 2x2 kernel is too small for optimal perfomance, but of course 512x512 kernel is too large... You should try to experiment yourself what kernel size is most optimal... – Agnius Vasiliauskas Sep 17 '10 at 16:35
  • @0x69 Often that's true, but I think in this case, 2x2 is optimal, at least in terms of minimizing sequential operations. Reading the neighborhood happens sequentially, so every iteration requires M^2 sequential operations, where M is the neighborhood size. For an N x N image to total number of sequential operations is O(M^2 log_M N). For example, a 512 x 512 image, a 2x2 neighborhood takes only 36 sequential operations, while a 16x16 neighborhood takes 576. – Kyle Simek Sep 20 '10 at 18:15
  • 1
    We also must take into account texture copy operations. For example for 4096x4096 image, if kernel is 2x2 - it takes 12 memory copy operations, but for 16x16 kernel - it's just 3 copy operations. That is why i've said that one needs to do experimental check what kernel size is optimal, because minimization of sequential operations increases total number of memory copy operations, which can amortize speed-up. – Agnius Vasiliauskas Sep 21 '10 at 06:18
  • See similar question here-
    http://stackoverflow.com/questions/2944290/efficient-pixel-shader-sum-of-all-pixels/3616269#3616269
    – Agnius Vasiliauskas Sep 17 '10 at 05:42

3 Answers3

5

The summation operation is a specific case of the "reduction," a standard operation in CUDA and OpenCL libraries. A nice writeup on it is available on the cuda demos page. In CUDA, Thrust and CUDPP are just two examples of libraries that provide reduction. I'm less familiar with OpenCL, but CLPP seems to be a good library that provides reduction. Just copy your color buffer to an OpenGL pixel buffer object and use the appropriate OpenGL interoperability call to make that pixel buffer's memory accessible in CUDA/OpenCL.

If it must be done using the opengl API (as the original question required), the solution is to render to a texture, create a mipmap of the texture, and read in the 1x1 texture. You have to set the filtering right (bilinear is appropriate, I think), but it should get close to the right answer, modulo precision error.

Kyle Simek
  • 9,576
  • 4
  • 26
  • 32
1

My gut tells me to attempt your implementation in OpenCL. You can optimize for your image size and graphics hardware by breaking up the images into bespoke chunks of data that are then summed in parallel. Could be very fast indeed.

Fragment shaders are great for convolutions but that result is usually written to the gl_FragColor so it makes sense. Ultimately you will have to loop over every pixel in the texture and sum the result which is then read back in the main program. Generating image statistics perhaps not what the fragment shader was designed for and its not clear that a major performance gain is to be had since its not guaranteed a particular buffer is located in GPU memory.

It sounds like you may be applying this algorithm to a real-time motion detection scenario, or some other automated feature detection application. It may be faster to compute some statistics from a sample of pixels rather than the entire image and then build a machine learning classifier.

Best of luck to you in any case!

1

It doesn't need CUDA if you like to stick to GLSL. Like in the CUDA solution mentioned here, it can be done in a fragment shader staight forward. However, you need about log(resolution) draw calls. Just set up a shader that takes 2x2 pixel samples from the original image, and output the average sum of those. The result is an image with half resolution in both axes. Repeat that until the image is 1x1 px. Some considerations: Use GL_FLOAT luminance textures if avaliable, to get an more precise sum. Use glViewport to quarter the rendering area in each stage. The result then ends up in the top left pixel of your framebuffer.

dronus
  • 10,774
  • 8
  • 54
  • 80