Why is using variables so much faster than just using the value?

Question

I have this piece of code that basically goes through a really big image and two ways that seems super similar have a 70% speed difference. First one is the fast one that takes around 10s

 if (clusterPntr[col] == i) {
                        
                        /* Calculate the location of the relevant pixel (rows are flipped) */
                        pixel = bmp->Data + ( ( bmp->Header.Height - row - 1 ) * bytes_per_row + col  * bytes_per_pixel );
                        /* Get pixel's RGB values */
                        b=pixel[0];
                        g=pixel[1];
                        r=pixel[2];
                        totr += r; 
                        totg += g;
                        totb += b;
                        sizeCluster++;

                    }

second one takes 17s

 if (clusterPntr[col] == i) {
                    
                    /* Calculate the location of the relevant pixel (rows are flipped) */
                    pixel = bmp->Data + ( ( bmp->Header.Height - row - 1 ) * bytes_per_row + col  * bytes_per_pixel );
                    /* Get pixel's RGB values */

                    //why is this SO MUCH SLOWER 
                    totr += pixel[2]; 
                    totg += pixel[1];
                    totb += pixel[0];
                    sizeCluster++;

                }

I would think that the problem lies in how cache and probably one version uses registers while the other one uses the data arrays. CPU is an M1 pro so the ARM architecture might have something to do as well

You are accessing the channels in a different order, which affects your memory access patern. Try to access the channels in the order. — wohlstad, Jan 01 '23 at 17:54
Could you expand the example so that it becomes compilable, then we can look at what Weird Things the compiler maybe have done. Maybe the backwards order made auto-vectorization fail, that sort of thing. — harold, Jan 01 '23 at 17:54
It could be pointer aliasing issue between `pixel` and `tot*` (which would hinder compiler optimizations), but hard to tell without full code. — Iwa, Jan 01 '23 at 17:58
Edit the question to provide a [mre]. Include the compiler version and the commands used to build. — Eric Postpischil, Jan 01 '23 at 17:58
may I suggest looking at the assembly generated by the compiler? use https://godbolt.org/ — Christoph Rackwitz, Jan 02 '23 at 14:06

score 1 · Answer 1 · edited Jan 01 '23 at 20:14

The only thing that really changes in your code is that you're accessing index 2 then index 1 then index 0, as opposed as index 0, 1 and 2.

An explanation (which could be inaccurate as some comments suggest):

In average, the first version (2,1,0) creates 3 cache defaults. The second one only creates one.

When loading pixel[0], it's very likely that it loads pixel[1] and pixel[2] in the cache, making those accesses much faster.

Another explanation from that answer to C for loop indexing: is forward-indexing faster in new CPUs?:

I then investigated further and found out that it looks like GCC (4.8.4) is unable to exploit the full power of SIMD operations in a backward loop.

Anyway, it's very likely that the performance difference originates from the access order. Since your algorithm doesn't impose read order, I would try the code below:

totb += pixel[0];
totg += pixel[1];
totr += pixel[2];

and a micro-optimization is easy here if pixel is not a constant pointer:

totb += (*pixel++);
totg += (*pixel++);
totr += (*pixel++);

That's not how caches work, "backwards" accesses also pull in the rest of the cache line — harold, Jan 01 '23 at 17:56
ok maybe, then in that case it could make compiler fail to fully optimize SIMD like explained here https://stackoverflow.com/a/36775430/6451573 — Jean-François Fabre, Jan 01 '23 at 18:03

Why is using variables so much faster than just using the value?

1 Answers1