1

Yes this is a homework question i just need a push in the right direction

Which code block of C++ is faster and why? I Think it is the top one because the [i] array is being used in order, or am i wrong here?.

    double A[100][100];
    ...
    for (int i = 0; i < 100; i++) {
        for (int j = 0; j < 100; j++) {
            A[i][j] = i * j;
        }
     }


    double A[100][100];
    ...
    for (int j = 0; j < 100; j++) {
    for (int i = 0; i < 100; i++) {
        A[i][j] = i * j;
    }
 }
Crowman
  • 25,242
  • 5
  • 48
  • 56
user1789951
  • 661
  • 3
  • 7
  • 12

2 Answers2

8

There is no way to know which piece of code is faster without running and profiling your code.

We could guess about how locality and cache behaviour would influence that timing (and your guess is a good one), but guesses aren't a substitute for profiling. (See: How can I profile C++ code running in Linux?)

One reason why the first version could be faster:

  • Accessing the elements of the array in the order that they are laid out in memory could allow the cache to take advantage of that. (See: What is “cache-friendly” code?)

Why there may be no difference:

  • The entire 10000 elements could fit in cache, rendering the aforementioned optimization moot.

I can't think of any reason why the second would be faster, but I've been surprised before.

Community
  • 1
  • 1
  • How can a prof ask a question word for word of "which is faster?" and expect an answer, I truly do not know how to even start to answer the question – user1789951 Oct 19 '13 at 00:50
  • 6
    @user1789951 It is a sad fact of life, that programming professors are rarely good programmers, and often teach things which are bad practice, out of date, and sometimes even wrong. – JBentley Oct 19 '13 at 00:51
  • Yeah i suppose, doesn't really help me though haha – user1789951 Oct 19 '13 at 00:56
  • 2
    + some compilers understand this now and will do the loop interchange for you. – Billy ONeal Oct 19 '13 at 01:17
  • @BillyONeal Thanks, you answered my question in the comments in Crashworks' answer. – JBentley Oct 19 '13 at 01:18
  • This answer seems to be written from the perspective of reading memory ("accessing the elements of the array..."), yet the question is about writing to memory, not reading. How does this answer change when taking that into account? (i.e. cache coherency, CAS latencies when writing the same/different banks, etc) – camh Oct 19 '13 at 04:38
  • @Sancho (To be fair, when I say "some compilers" I mean MSVC++ 2013 -- though if I'm not mistaken this optimization has been in ICC for longer) – Billy ONeal Oct 19 '13 at 05:43
6

The most general answer is: you need to profile both blocks and see the result empirically.

However I can give you an answer for the majority of modern x86, x64, PPC, and ARM processors with hierarchical caches. On these platforms the top one will be faster because of better data locality: it accesses the memory addresses sequentially, so you will hit data cache more often. Smart x86 and x64 implementations will even notice that you are reading memory sequentially in this fashion, and prefetch the next cache line before you need it. The bottom pattern accesses memory unsequentially across distant addresses, meaning that you are likely to miss cache on every read.

Ulrich Drepper has a nice paper about this. One of his examples in that paper demonstrates exactly how those two blocks of code differ.

As an example of the math here, assume arguendo that you are programming an Intel Corei7 with a 64 byte cache line size, and a 32kb L1 data cache. That means that every time you fetch an address, the processor will also fetch all other data in that 64-byte-aligned block. On that platform, a double is eight bytes, so you fit eight of them per cache line. Thus the top example will, on average, miss on one out of eight iterations: after each miss, the next 56 bytes will be fetched also, thus the next seven double* reads will be in cache.

The bottom example could probably fit 100 lines of data (one for each i) into cache simultaneously: 100 * 64 = 6400 bytes, well within cache size. But it's also likely that you'll exceed the associative cache, meaning that two lines will map to the same SRAM in the L1, meaning that one will evict the other.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • Great answer. Any thoughts on compiler optimizations which could make the two equivalent? – JBentley Oct 19 '13 at 01:13
  • Yes, great answer. @JBentley If this is the only code in main, the compiler could simply optimize the loops away to nothing, I suppose. –  Oct 19 '13 at 01:15
  • @sancho Well, yes! But assuming the array is later used in an observable way, could the compiler realise that reordering the element accesses doesn't interfere with the logic of the block, and therefore do so, in a cache friendly way? – JBentley Oct 19 '13 at 01:17
  • 2
    @JBently If the compiler can be sure that nothing aliases those addresses (ie via `restrict` and having no other pointer accesses in that loop), and it's aware of the cache characteristics, and it's smarter than the version of GCC I have installed, it could turn the bottom example into the top one. – Crashworks Oct 19 '13 at 01:18
  • You are addressing reading from memory, but the example in the question is about writing to memory. See also my comment on @Sancho's answer. – camh Oct 19 '13 at 04:40
  • @camh all those platforms have write-back caches: to write to an address it must first be fetched into cache. Some chips have special opcodes to write through straight to memory without polluting cache, but generally they are only emitted when one explicitly uses a compiler intrinsic. I've never ever seen a compiler that automatically chooses a non cached write opcode for ordinary assignment as here. – Crashworks Oct 19 '13 at 05:12