Why is my program slow when looping over exactly 8192 elements?

Question

Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:

img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.

Are you sure that the issue is caused by the code you posted? Where else are `i` and `j` used? — Dan Puzey, Sep 04 '12 at 13:57
I don't understand how you can get faster results with 8193 instead of 8192. Did you tried that several times ? — bokan, Sep 04 '12 at 14:01
@bokan it happens when the size is a multiple of the critical stride of the cache. — Luchian Grigore, Sep 04 '12 at 14:03
@Mysticial, it doesn't matter, it exposes same exact problem; code can be different, but basically both of the question ask about the same time (and their titles are definitely similar). — Griwes, Sep 04 '12 at 14:09
You should not process image using 2 dimension array if you want high performance. Consider all pixels being in a raw and process them like a one dimension array. Do this blur in two pass. First add value of surrounding pixels using a sliding sum of 3 pixels : slideSum+=src[i+1]-src[i-1]; dest[i]=slideSum;. Then do the same vertically and divide at same time : dest[i]=(src[i-width]+src[i]+src[i+width])/9. http://www-personal.engin.umd.umich.edu/~jwvm/ece581/18_RankedF.pdf — bokan, Sep 04 '12 at 14:25
There's actually two things going on here. It's not just super-alignment. — Mysticial, Sep 04 '12 at 14:26
@Adam12 And how is that a problem? I've had more than 1GB on the stack from time to time. On a modern machine (32 bits or more), it shouldn't be a problem. — James Kanze, Sep 04 '12 at 15:24
@Adam12: Since the code snippet isn't a complete, compilable example (there's no `main` function), I'd assume that the data is global, not on the stack, without further information (and it's 500 MB, not 1 GB, though that's still quite a bit larger than the default stack size on most systems). — Adam Rosenfield, Sep 04 '12 at 22:21
@bokan: While that's true in general, especially for dynamically allocated arrays, it won't make any difference in this case. An `N,M` 2D array is laid out in memory the same way as a 1D array of `N*M` elements, so it's the same whether you write `a[i][j]` or `a[i*stride+j]`. Dynamically allocated arrays can only have one dimension be dynamic, so if you don't know the width and height until runtime, it's much better to use a flat 1D array instead of an array of pointers to arrays. — Adam Rosenfield, Sep 04 '12 at 22:25
@AdamRosenfield 1D array is the optimization I propose in the second answer (wich is not a direct answer to the question). — bokan, Sep 04 '12 at 22:46
(Just a minor nitpick on your answer. For the first code segment, it would be nice if all your for loops had braces.) — Trevor Boyd Smith, Sep 05 '12 at 16:35
@bokan, a sliding sum is applicable to both 1D and 2D arrays. I don't think there's any appreciable speedup to 1D vs. 2D access, unless the column and row are reversed as in this example. — Mark Ransom, Sep 06 '12 at 02:55
@MarkRansom Just check the second answer to this question. Sliding sum is interesting if you average more than 3 value. With 3 value it uses 2 add, so there's no benefits. — bokan, Sep 06 '12 at 02:59

score 1005 · Accepted Answer · edited May 23 '17 at 10:31

1005

The difference is caused by the same super-alignment issue from the following related questions:

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Matrix multiplication: Small difference in matrix size, large difference in timings

But that's only because there's one other problem with the code.

Starting from the original loop:

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

First notice that the two inner loops are trivial. They can be unrolled as follows:

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

So that leaves the two outer-loops that we're interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.

To solve this problem, you should interchange the two loops.

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.

Core i7 920 @ 3.5 GHz

Original code:

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops:

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

edited May 23 '17 at 10:31

Community

1
1

answered Sep 04 '12 at 14:43

Mysticial

464,885
45
335
332

6

Now that I look at the code again, it looks like an image blurring operation. It's averaging the intensity with adjacent cells. – Mysticial Sep 04 '12 at 14:49
244

I'll also note that unrolling the inner loops has no effect on performance. The compiler probably does it automatically. I unrolled them for the sole purpose of getting rid of them to make it easier to spot the problem with the outer loops. – Mysticial Sep 04 '12 at 16:54
2

Even though this is the Right Answer something isn't sitting well with me... Shouldn't the outer/first index be 'i' and the inner one be 'j'? This whole problem could have been solved by just switching the indices in the original problem since the OP was looping over 'i' first. – Mark Canlas Sep 04 '12 at 17:04
2

Also, instead of unrolling the loop you should probably just comment it out. Despite your note, it still looks like the unrolling is a significant part of the fix, visually. – Mark Canlas Sep 04 '12 at 17:05
1

@MarkCanlas Yeah, I agree. `i` should go before `j`. But while I was testing it, it was easier for me to just drag/drop the for-loop instead of changing all the indices. As for the loop unrolling. I tested it and it made no difference in VS2010. But eliminating it will probably make the compiler optimize out *all* the code. – Mysticial Sep 04 '12 at 17:08
30

And you can speed this code up by another factor of three by caching the sums along each row. But that and other optimizations are outside the scope of the original question. – Eric Postpischil Sep 04 '12 at 18:27
2

If the 2nd loop is more effective than the first, it makes me want to not use c++ again – Ali Sep 04 '12 at 19:43
37

@ClickUpvote This is actually a hardware (caching) issue. It has nothing to do with the language. If you tried it in any other language that compiles or JITs to native code, you would probably see the same effects. – Mysticial Sep 04 '12 at 20:02
@MarkCanlas : I doubt the factor of three, by caching you need more memory and swapping/cachemisses might occur, so I think it needs testing. – ted Sep 04 '12 at 21:05
@Mysticial I'm happy with my bytecode JVM if it means I don't have to write ugly code like that second loop – Ali Sep 04 '12 at 22:40
25

@ClickUpvote: You seem rather misguided. That "second loop" was just Mystical unrolling the inner loops by hand. This is something your compiler will almost certainly do anyway, and Mystical only did it to make the issue with the outer loops more obvious. It is by no means something you should bother doing yourself. – Lily Ballard Sep 04 '12 at 23:44
16

I actually didn't notice the problem with the outer loop until *after* I unrolled the inner loops. It wasn't obvious at all through 4 nesting levels. Which is funny because loop-unrolling usually makes code more *unreadable*. But in this case, it let me see the problem. – Mysticial Sep 05 '12 at 00:07
2

To be fair, @ClickUpvote might benefit from a higher-level language than C++ if it optimizes away problems like this for you, for example if it were functional and you just specified what you wanted done to a matrix and let the compiler sort out the details of how the looping occurs (and probably parallelization). So the cause is hardware, but he may very well benefit from software. – Chris Moschini Sep 05 '12 at 00:12
6

@ted: The factor of three is not speculative; I timed it. You do not cache an entire copy of the array, just the sums of the most recent two rows. – Eric Postpischil Sep 05 '12 at 02:39
171

THIS is a perfect example of a good answer on SO: References similar questions, explains step-by-step how you approached it, explains the problem, explains how to FIX the problem, has great formatting, and even an example of the code running on your machine. Thank you for your contribution. – MattSayar Sep 05 '12 at 03:11
@EricPostpischil: Interesting, and there is no size where the data without those two last rows fills up the cache while without caching it still fits, while with caching cache misses occur? (We are not considering allignment issues anymore but cache size issues). I still find the factor of three impressive since you also have looping and a divison in adition to the three addition blocks from which you can remove two for two additional additions. If I am not mistaken you still need 5 additions out of 9 (3 for the non cached row +2 additions from cache), +an additional write to the caching rows. – ted Sep 05 '12 at 05:41
@EricPostpischil: besides how do you index the cached rows? You can't do `previousRow[i]` and `RowBeforePreviousRow[i]` since this would require you to move the contents of `previousRow[i]` to `rowBeforePreviousRow[i]`, modulo indexing? I would be really interested to see the modified code if you could post it please. – ted Sep 05 '12 at 05:45
@ted Check my answer beside. It does not address the cache question but performance. – bokan Sep 05 '12 at 12:29
2

@EricPostpischil, that technique applied in two dimensions in a very narrow field was sufficient to get the first of my two patents: [4,811,414](http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/4811414). This was back in the days when 25 Mhz was a fast processor. – Mark Ransom Sep 06 '12 at 02:34
6

I implemented a similar test in Java and the time increase is also noticeable: 1.3 s. when iterating by columns, 0.3 s. when iterating by rows. Problem size was 4096 elements, computer was a 2 year laptop. – Mister Smith Sep 06 '12 at 08:29

score 59 · Answer 2 · edited Feb 20 '15 at 07:08

The following tests have been done with Visual C++ compiler as it is used by the default Qt Creator install (I guess with no optimization flag). When using GCC, there is no big difference between Mystical's version and my "optimized" code. So the conclusion is that compiler optimizations take care off micro optimization better than humans (me at last). I leave the rest of my answer for reference.

It's not efficient to process images this way. It's better to use single dimension arrays. Processing all pixels is the done in one loop. Random access to points could be done using:

pointer + (x + y*width)*(sizeOfOnePixel)

In this particular case, it's better to compute and cache the sum of three pixels groups horizontally because they are used three times each.

I've done some tests and I think it's worth sharing. Each result is an average of five tests.

Original code by user1615209:

8193: 4392 ms
8192: 9570 ms

Mystical's version:

8193: 2393 ms
8192: 2190 ms

Two pass using a 1D array: first pass for horizontal sums, second for vertical sum and average. Two pass addressing with three pointers and only increments like this:

imgPointer1 = &avg1[0][0];
imgPointer2 = &avg1[0][SIZE];
imgPointer3 = &avg1[0][SIZE+SIZE];

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(*(imgPointer1++)+*(imgPointer2++)+*(imgPointer3++))/9;
}

8193: 938 ms
8192: 974 ms

Two pass using a 1D array and addressing like this:

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(hsumPointer[i-SIZE]+hsumPointer[i]+hsumPointer[i+SIZE])/9;
}

8193: 932 ms
8192: 925 ms

One pass caching horizontal sums just one row ahead so they stay in cache:

// Horizontal sums for the first two lines
for(i=1;i<SIZE*2;i++){
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// Rest of the computation
for(;i<totalSize;i++){
    // Compute horizontal sum for next line
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
    // Final result
    resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}

8193: 599 ms
8192: 652 ms

Conclusion:

No benefits of using several pointers and just increments (I thought it would have been faster)
Caching horizontal sums is better than computing them several time.
Two pass is not three times faster, two times only.
It's possible to achieve 3.6 times faster using both a single pass and caching an intermediary result

I'm sure it's possible to do much better.

NOTE Please, note that I wrote this answer to target general performance issues rather than the cache problem explained in Mystical's excellent answer. At the beginning it was just pseudo code. I was asked to do tests in the comments... Here is a completely refactored version with tests.

"I think it's at least 3 times faster"—care to back up that claim with some metrics or citations? — Adam Rosenfield, Sep 05 '12 at 04:27
@AdamRosenfield "I think" = supposition != "It is" = claim. I have no metric for this and I would like to see a test. But mine require 7 increments, 2 sub, 2 add and one div per pixel. Each loop using less local var than there are register in CPU. The other require 7 increments, 6 decrements, 1 div and between 10 to 20 mul for addressing depending on compiler optimization. Also each instruction in the loop require the result of the previous instruction, this discard the benefits of super-scalar architecture of Pentiums. So it has to be faster. — bokan, Sep 05 '12 at 09:39
The answer to the original question is all about memory and cache effects. The reason that OP's code is so slow is that it its memory access pattern goes by columns instead of by rows, which has very poor cache locality of reference. It's *particularly* bad at 8192 because then consecutive rows end up using the same cache lines in a direct-mapped cache or cache with low associativity, so the cache miss rate is even higher. Interchanging the loops provides an enormous performance boost by greatly increasing cache locality. — Adam Rosenfield, Sep 05 '12 at 19:42
So while you might be able to squeeze a little more performance by counting instructions and micro-optimizing like you have, the big, big performance gains come from making a single pass through the data in row order to maximize cache locality (which you've also done). I believe a 3x gain (or more) over the original code due to the loop interchange, but definitely not a 3x gain over Mystical's answer. — Adam Rosenfield, Sep 05 '12 at 19:44
also there are some cpus which have specail instructions for array addressing, so this performance increase dacays further. And on the other hand there might be compiler optimizations, so that the effects of your optimizations are small compared to mysticals awnser. — ted, Sep 05 '12 at 20:40
@AdamRosenfield I've done the tests and added them. I also done optimization using both horizontal sum caching and one pass. — bokan, Sep 06 '12 at 02:11
Well done, those are some impressive numbers. As you found, it's all about memory performance -- using several pointers with increments didn't provide any benefit. — Adam Rosenfield, Sep 06 '12 at 15:59
@AdamRosenfield I was quite worried this morning because I could not reproduce the tests. It appears that the performance increase is only with Visual C++ compiler. Using gcc, there's only a small difference. — bokan, Sep 06 '12 at 19:50

Why is my program slow when looping over exactly 8192 elements?

2 Answers2

Linked

Related