efficiency when handling consecutive blocks vs. non consecutive blocks of memory

Question

i have a struct

struct A
{
int v[10000000];
};

if i have A a[2]; and wish to calculate the total sum of values which of these 2 methods is the fastest?

int method_1(const A &a[],int length)
{
int total = 0;
for(int i=0;i<length;i++)
for(int j=0;j<10000000;j++)
total+=a[i][j];

return total;
}


int method_2(const A &a[],int length)
{
int total = 0;
for(int j=0;j<10000000;j++)
for(int i=0;i<length;i++)
total+=a[i][j];

return total;
}

a[2] is declared as two consective blocks of struct A as so:

----a[0]---- /--- a[1]----

[][][][][][][][]/[][][][][][][][]

so, i might be tempted to say that method_1 is faster, based on intuition that the blocks are consecutive and the iteration through each block's v is also consecutive.

What i am really interested in is how the memory is really accessed and how is the most efficient way to access it.

EDIT

i have changed the v size from 32 to 10000000, because apparently it wasn't understood that i was referring to a general case

Benchmark! Call each function a million times, and time the loops. Both with and without optimizations. You can also look at the generated code do see if there are many differences. — Some programmer dude, Nov 26 '13 at 11:53
`method_1` is more intuitive, and is easier to read/understand. I would stick with that... For such small things, I think any real differences will be *statistically insignificant*, but then, that's only a guess... do as @JoachimPileborg says... — Nim, Nov 26 '13 at 12:00
Apart from the code not beeing compilable (missing Type), method_1 will be faster. It's way more cache friendly. — stefan, Nov 26 '13 at 12:09
@JoachimPileborg i can't base my work on pure benchmark testing. This is easy for simple cases, but for complex scenarios it's much more harder. I was interested if maybe there where pre-loaders for that or maybe other c++ under the hood operations that i might take advantage of. — Alex, Nov 26 '13 at 12:10
[possible duplicate](http://stackoverflow.com/a/13093193/2295679) — Cahu, Nov 26 '13 at 12:10
@stefan i wrote the code directly here, missed that, it was specified — Alex, Nov 26 '13 at 12:13

score 2 · Accepted Answer · answered Nov 26 '13 at 13:00

Each time a memory fragment is read a whole cache line will be read from main memory to the CPU cache, today you'll probably have a 32byte long cache lines. Mostly because of this reading consecutive memory blocks is fast.

Now there is more then one cache line...

In your case both cases may have similar performance as both arrays will most probably not collide into the same cache line and so both may be in the cache on different lines so I suspect performance will be similar.

One related thing you may consider in this case to change the performance is NOT using the [] operators in favor of iterating more using "iterators" like this:

int method_1(const A &a[],int length)
{
    int total = 0;
    for(const A* aIt=a;aIt<a+length;++aIt)
        for(const v* vIt=aIt->v;vIt<aIt->v+10000000;++vIt)
            total+=*vIt;

    return total;
}

This way you avoid a double [] which is simply a multiplication by the sizeof of an array element (which may be optimized but may not and if not it will be costly when called millions of times). Your compiler may be smart enough to optimize the code just as I've shown to only use additions but... it very well may not be and I've seen this make a big difference when the operation performed for each of the elements is as trivial as an incrementation - you're best to measure this and see how these options work out in your environment.

i am a bit unfamiliar with cache handling by the compiler. From what i've read, when memory fragments are read they are stored in sets and lines, but only one set is active at any given time. How are the lines number determined, and what does a line actually contain (i.e. from 'where' to 'where') and why are the sets slip in lines? — Alex, Nov 27 '13 at 10:08
@BadescuAlexandru The compiler doesn't handle the cache. However it knows how caches work and optimize code to be cache friendly when possible. For the sets and lines stuff, you might want to read about [cache associativity](https://en.wikipedia.org/wiki/Cache_associativity#Associativity). — Cahu, Nov 27 '13 at 20:38

Cahu · Answer 2 · 2013-11-26T12:12:24.853

1

Accessing elements in the order they appear in memory will improve performance in most cases since it allows the prefetcher to load data before you even use it. Besides, if you use data in a non-contiguous way, you might load and discard the same cache line many times and this has a cost.

edited Nov 26 '13 at 12:12

answered Nov 26 '13 at 12:01

Cahu

249
2
9

score 0 · Answer 3 · answered Nov 26 '13 at 11:55

0

Data size is small enough to be fit completely in a single cache line on modern CPUs. I'm not sure about vertorizing this code by compiler

answered Nov 26 '13 at 11:55

Andriy Tylychko

15,967
6
64
112

the data is pure for explanation purpose. v could hold milions and 'a' may have the length of thousands. – Alex Nov 26 '13 at 12:01
1

@BadescuAlexandru: but you got the idea, didn't you? for much larger data CPU cache can be really important for the performance – Andriy Tylychko Nov 26 '13 at 12:12

score 0 · Answer 4 · answered Nov 26 '13 at 11:56

0

I don't think method_2 is slower than method_1. The chunk of memory will be taken to CPUs main memory and then accessing a[0] and a[1] both will be take same time.

For a safer side, method_1 can always be considered better than method_2.

answered Nov 26 '13 at 11:56

Vishal R

1,026
1
7
21

efficiency when handling consecutive blocks vs. non consecutive blocks of memory

4 Answers4