-1

I'm trying to implement a multichannel integral image algorithm, but it's too slow(8 seconds for 200 images(640x480), on Core 2 Quad). I expect it to reach 1 second for 200 images.

This is the profiling result(over 200 images, n_bins=8):

Profiling result of multichannel integral image code

How can I optimize *ps = *psu + s?

Plain text version of code

whenov
  • 733
  • 1
  • 9
  • 19
  • Was it compiled with the highest optimization settings? – Tyler Apr 15 '14 at 03:04
  • @Tyler Yes, it was compiled with /O2 option(Maximize Speed). – whenov Apr 15 '14 at 03:08
  • Can you post the assembly? I'm wondering how much the compiler did to vectorize it. – Tyler Apr 15 '14 at 03:12
  • @Tyler http://pastebin.com/m66rtk5S, thanks. – whenov Apr 15 '14 at 03:22
  • @Tyler Sorry I just changed one line of the code, here is the original version:http://pastebin.com/dP2t2AAY – whenov Apr 15 '14 at 03:27
  • It's not vectoring at all. I suspect the main reason it is so slow is because it is memory bound, it has to load each byte from memory separately. It can't vectorize because the memory accesses aren't linear. Is there any way you could rearrange the code to make `ps` and `psu` increment by 1 instead of `n_bins`? – Tyler Apr 15 '14 at 03:30
  • @Tyler Thanks for your explanation! I would like to use sse to compute sum of multi-channel region in next step, so the channel dimension must be the innermost dimension for continuous memory. – whenov Apr 15 '14 at 03:47
  • I wish I could say I knew what that meant, but I can't; I've never done anything with image processing. Is that a yes or a no? If it's a yes, then you should rewrite it as I described and post the new assembly (if the compiler doesn't vectorize, I'll explain how to do it by hand). If no, then the algorithm is as fast as it's going to get. – Tyler Apr 15 '14 at 03:53
  • @Tyler Sorry for my poor english writing, it means no. Because there's a much more frequent process which must be optimized by SIMD, the SIMD optimization also relies on the continuity of memory. So I reshaped the matrix dimensions from (width, height, channel) to (channel, width, height) in the integraling process, which caused the `n_bin` increment step. – whenov Apr 15 '14 at 04:07
  • @Tyler: the compiler will not vectorize this because of the prefix sum, which induces a strong data dependency. –  Apr 15 '14 at 09:47
  • This speed is indeed unexpected. You have 60 MB in and 240 MB out, much less than a processor can handle in a second. But you don't tell how many channels you have, nor which architecture !? –  Apr 15 '14 at 09:57
  • Yves Daoust, maybe vectorize isn't the right word, but what I meant is to load multiple bytes at once and work with the registers primarily. If the memory accesses were contiguous, it could load 16 contiguous bytes at once into an SSE reg and loop over bytes in the SSE reg. Storing the resultant bytes in another SSE reg and then writing that out as the result. I've used a technique similar to this to implement strlen. – Tyler Apr 15 '14 at 13:55
  • @YvesDaoust I have 8 channels, on a Core 2 Quad processor. – whenov Apr 15 '14 at 14:35
  • 1
    @Tyler: SSE does not support horizontal prefix sums. Emulating it would provide no speedup. It is possible to implement vertical prefix sums efficiently. In the case of integral images only part of the algorithm can be vectorized, and requires transformations that a compiler is not able to invent. –  Apr 15 '14 at 18:57
  • You should have a look at how your algorithm fits with caches, and possibly reorder the memory accesses to improve locality. –  Apr 15 '14 at 19:08

3 Answers3

1

Start to check compiler settings, is it set to maximum performance?

Than, depending from architecture, calculation of integral image have several bottleneck.

  1. Computations itself, some low cost CPU can't perform integer math with good performance. No solution.

  2. Data flow is not optimal. The solution is to provide optimal data flows ( number of sequential read and write streams). For example you can process 2 rows simultaneously.

  3. Data dependency of algorithm. On modern CPU it can be biggest problem. The solution is to change processing algorithm. For example calculate odd/even pixels without dependency (more calculations , less dependency).

  4. Processing can be done using GPU.

minorlogic
  • 1,872
  • 1
  • 17
  • 24
1

I have trouble believing that profile result. In this code

16      for (int x = 1; x < w + 1; x++, pg++, ps += n_bins, psu += n_bins) {
17          s += *pg;
18          *ps = *psu + s;
19      }

it says the lion's share of time is on line 18, very little on 17, and next to nothing on line 16. Yet it is also doing a comparison, two increments, and three adds on every iteration. Cache-misses might explain it, but there's no harm in double-checking, which I do with this technique.

Regardless, the loop could be unrolled, for example:

int x = w;
while(x >= 4){

  s += pg[0];
  ps[n_bins*0] = psu[n_bins*0] + s;

  s += pg[1];
  ps[n_bins*1] = psu[n_bins*1] + s;

  s += pg[2];
  ps[n_bins*2] = psu[n_bins*2] + s;

  s += pg[3];
  ps[n_bins*3] = psu[n_bins*3] + s;

  x -= 4;
  pg += 4;
  ps += n_bins*4;
  psu += n_bins*4;
}
for(; --x >= 0;){
  s += *pg;
  *ps = *psu + s;
  pg++;
  ps += n_bins;
  psu += n_bins;
}

If n_bins happens to be a constant, this could enable the compiler to do some more optimizing of the code in the while loop.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
0

You probably don't compute integral images just for the sake of computing integral images.

I imagine two situations:

1) you use the integral images on every pixel to compute a box filter or similar.

2) you use them at a much smaller number of places.

In case 1), the computation of the integral images will probably not be the bottleneck in your application.

In case 2), you should wonder if it is worth to compute the entire integral images.

This said, parallelization with four threads is also an option. The easiest is to let every thread compute every fourth image.

You can also split every image in four, but you will be penalized by the need to synchronize the threads, but also by the fact that prefix sums are constrained by a data dependency. (You can split the image in four and compute separate integral images, but after this step you will need to add a constant to three of the quarter images.)

  • My situation is case 1, similar to a box filter. And you are right, the bottleneck is not integral images part. I would consider spliting the image in four parts, but I think adding a constant is not enough. Thanks for your advice:-) – whenov Apr 16 '14 at 02:26
  • @whenov: processing whole images is much easier, and your images are to small to be split. –  Apr 16 '14 at 06:22