10
#include <stdio.h>
#include <time.h>

#define N  32768

char a[N][N];
char b[N][N];

int main() {
    int i, j;

    printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
    printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);

    clock_t start = clock();
    for (j = 0; j < N; j++)
        for (i = 0; i < N; i++)
            a[i][j] = b[i][j];
    clock_t end = clock();
    float seconds = (float)(end - start) / CLOCKS_PER_SEC;
    printf("time taken: %f secs\n", seconds);

    start = clock();
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++)
            a[i][j] = b[i][j];
    end = clock();
    seconds = (float)(end - start) / CLOCKS_PER_SEC;
    printf("time taken: %f secs\n", seconds);

    return 0;
}

Output:

address of a[32768][32768] = 0x80609080
address of b[    0][    0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs

Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of enter image description here in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
Kakaji
  • 1,421
  • 2
  • 15
  • 23

3 Answers3

15

It pretty much comes down to this image (source):

enter image description here

When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.

The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.

Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.

fresskoma
  • 25,481
  • 10
  • 85
  • 128
  • 3
    This is the correct explanation, but you need to explain why the data is reloaded multiple times as the number of rows and columns is such that the data does not fit in L1 cache memory. – chqrlie Dec 15 '15 at 23:34
  • 1
    When I change the arrays to char a[N][N+1]; char b[N][N+1]; column by column copying becomes faster (takes like 10 secs). Why's that? – Kakaji Dec 16 '15 at 06:29
  • 2
    @user3711976, because of cache alignment issues - http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of – Leeor Dec 16 '15 at 07:07
5

While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).

Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...

In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.

If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.

In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.

In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.

Gene
  • 46,253
  • 4
  • 58
  • 96
1

There are 3 main aspects that contribute to the timing different:

  1. The first double loop accesses both arrays for the first time. You are actually reading uninitialized memory which is bad if you expect any meaningful results (functionally as well as timing-wise), but in terms of timing what plays part here is the fact that these addresses are cold, and reside in the main memory (if you're lucky), or aren't even paged (if you're less lucky). In the latter case, you would have a page fault on each new page, and would invoke a system call to allocate a page for the first time. Note that this doesn't have anything to do with the order of traversal, but simply because the first access is much slower. To avoid that, initialize both arrays to some value.

  2. Cache line locality (as explained in the other answers) - if you access sequential data, you miss once per line, and then enjoy the benefit of having it fetched already. You most likely won't even hit it in the cache but rather in some buffer, since the consecutive requests will be waiting for that line to get fetched. When accessing column-wise, you would fetch the line, cache it, but if the reuse distance is large enough - you would lose it and have to fetch it again.

  3. Prefetching - modern CPUs would have HW prefetching mechanisms that can detect sequential accesses and prefetch the data ahead of time, which will eliminate even the first miss of each line. Most CPUs also have stride based prefetches which may be able to cover the column size, but these things don't work well usually with matrix structures since you have too many columns and it would be impossible for HW to track all these stride flows simultaneously.

As a side note, I would recommend that any timing measurement would be performed multiple times and amortized - that would have eliminated problem #1.

Leeor
  • 19,260
  • 5
  • 56
  • 87