Why column-major traversal is actually faster than row-major traversal when 2d array is small

Question

I am experimenting the row- vs column-major traversal. The general idea is that row-major traversal should be faster due to caching/vectorization/etc. Here is my test code:

// gcc -o main.out main.c -O3
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
  size_t dim[] = {10, 50, 100, 200, 500, 1000, 5000, 10000, 20000};
  struct timespec ts;
  uint32_t* arr_ptr;
  double delta, t0;
  printf(
    "Dim\tArraySize(KB)\tRow-Major Time\tRM Sample\tCol-Major Time\tCM Sample\n"
  );
  for (int i = 0; i < sizeof(dim) / sizeof(dim[0]); ++i) {
    size_t d = dim[i];
    printf("%5lu,\t%11lu,\t", d, d * d * sizeof(uint32_t) / 1024);
    arr_ptr = (uint32_t*)malloc(d * d * sizeof(uint32_t));
    memset(arr_ptr, 0, d * d * sizeof(uint32_t));
    timespec_get(&ts, TIME_UTC);
    t0 = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0;
    for (int j = 0; j < d; ++j) {
      for (int k = 0; k < d; ++k) {
        *(arr_ptr + j * d + k) += (j + k);
      }
    }
    timespec_get(&ts, TIME_UTC);
    delta = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0 - t0;
    printf("%0.9lf,\t%8u,\t", delta, *(arr_ptr + ts.tv_sec % d * d + ts.tv_nsec % d));
    free(arr_ptr);

    arr_ptr = (uint32_t*)malloc(d * d * sizeof(uint32_t));
    memset(arr_ptr, 0, d * d * sizeof(uint32_t));
    timespec_get(&ts, TIME_UTC);
    t0 = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0;
    for (int j = 0; j < d; ++j) {
      for (int k = 0; k < d; ++k) {
        *(arr_ptr + k * d + j) += (j + k);
      }
    }
    timespec_get(&ts, TIME_UTC);
    delta = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0 - t0;
    printf("%0.9lf,\t%8u\n", delta, *(arr_ptr + ts.tv_sec % d * d + ts.tv_nsec % d));
    free(arr_ptr);
  }
  return 0;
}

The result is as follows:

Dim ArraySize(KB)   Row-Major Time  RM Sample   Col-Major Time  CM Sample
   10,            0,    0.000000238,          10,   0.000000238,          18
   20,            1,    0.000000238,          19,   0.000000477,          40
   50,            9,    0.000002384,          31,   0.000001431,         122
  100,           39,    0.000013113,          96,   0.000005245,         272
  200,          156,    0.000077486,         263,   0.000042200,         240
  500,          976,    0.000452757,         558,   0.000420809,         362
 1000,         3906,    0.001549959,        1095,   0.001713991,        1030
 2000,        15625,    0.005422354,        3281,   0.006698370,        3571
 5000,        97656,    0.036512375,        5439,   0.053869963,        4949
10000,       390625,    0.148355007,        7462,   1.049520969,        6046
20000,      1562500,    0.768568516,       22097,   7.388022661,       32356

The general theory holds only after the dimension reaches 500--before that, column-major traversal keeps outperforming row-major traversal. This seems not random--I ran the test multiple times with different gcc parameters, such as -O3 -ffast-math and -O3 -fno-tree-vectorize, the performance gap appears to be stable. But why?

You also use wrong format specifier for printing: `printf("%5lu,\t%11lu,\t", d, ...`. The correct format specifier for `size_t` is `%zu`, not `%lu`. — Gerhardh, Sep 29 '22 at 06:57
'Twould be interesting to see the results if the array(s) weren't being malloc'd out of dynamic memory (where anything can happen), but were defined at compile time. You could define the biggest, but only use bite-size pieces from the top-left corner for smaller trials... Get back to us, okay? (Remember, in some schemes malloc returns a promise, not an entire block of memory...) — Fe2O3, Sep 29 '22 at 06:59
Hi @Gerhardh, I believe `+= (j+k)` is well-defined because uint32_t never overflows. Therefore, no matter what is in the uninitialized memory, it is still defined. Also, I actually switched on `-Wall` and gcc does't complain `%lu`. According to this link, `%lu` should be okay: https://stackoverflow.com/questions/3209909/how-to-printf-unsigned-long-in-c — D.J. Elkind, Sep 29 '22 at 07:00
Hi @Fe2O3 , when you say "defined at compile time", you mean I use something like `uint_32t arr[10][10]`, so that the array will be on stack instead of heap? If so, the issue is, the array will be very small--my computer sends sigment after `arr[50][50]` or so--it is only a few kbytes. — D.J. Elkind, Sep 29 '22 at 07:03
Well, no, `arr[50][50]` would not be same. You do the row/col part yourself. You would need `arr[20000*20000]`. Of course that won't fit into your stack. You need to make it global. — Gerhardh, Sep 29 '22 at 07:04
@Gerhardh I think memory stores bytes--after it is `malloc()'ed` to me, it is determined. Just to the program is it random bytes in it. But since you asked, I added `memset(arr_ptr, 0, d * d * sizeof(uint32_t));` to the post, the result is the same. — D.J. Elkind, Sep 29 '22 at 07:06
It's because how *2D* arrays are stored in memory, *CPU* cache and how data is fetched from *RAM*. Going rowwise it will fetch (part of or) an entire row (or maybe even more) once (when fetching 1st element), but going columnwise it will fetch for every element. — CristiFati, Sep 29 '22 at 07:10
I use `+=` exactly because I want to increase randomness. Regarding whether or not reading uninitialized memory is undefined? https://stackoverflow.com/questions/66839765/does-reading-an-uninitialized-malloc-memory-invoke-undefined-behaviors Well...I tend to say at least for unsigned int, it should be fine as each possible bits combination is well-defined...But anyway, I added `memset()`, it behaves the same.. — D.J. Elkind, Sep 29 '22 at 07:18
Does this answer your question? [Why does the order of the loops affect performance when iterating over a 2D array?](https://stackoverflow.com/questions/9936132/why-does-the-order-of-the-loops-affect-performance-when-iterating-over-a-2d-arra) — CristiFati, Sep 29 '22 at 07:19
@Gerhardh that is why I added two sample columns. As a result, even the smartest compiler cannot simply throw away my loop. — D.J. Elkind, Sep 29 '22 at 07:22
Here's an alternative: Run the loop(s) using heap space once setting the value of every byte/int/whatever. Then, run the loop(s) a 2nd time "timed" (WITHOUT free'ing the block inbetween!)... My guess is that the significant difference is because of the memory allocation scheme deep in the mysteries of malloc... — Fe2O3, Sep 29 '22 at 07:24
@Fe2O3 I thought about what you proposed: "Twould be interesting to see the results if the array(s) weren't being malloc'd out of dynamic memory " --> but seems it wont help, if I switch the method, and everything works as expected, still it doesnt solve the current puzzle... — D.J. Elkind, Sep 29 '22 at 07:28
@Fe2O3 hey your last suggestion works--after adding another loop before timing it, the 2nd loop behaves the way we generally expect... — D.J. Elkind, Sep 29 '22 at 07:31
"solving puzzles"... You cannot imagine the 'cleverness' of the people who've implemented the hardware and the software support that is used. Shaving picoseconds off execution time is their jam! Malloc may only 'promise' regions of heap to you, and rely on page-faults to adjust things when/as they are actually used. My latest suggestion is to "actually use" the entire region(s) once, _then_ subject the region(s) to a timed trial... Expected results would point a big finger at "memory management optimisations" that you & I just dream of. — Fe2O3, Sep 29 '22 at 07:33
@CristiFati no, this question is about the opposite. In theory, the link you shared should be observed, but here we observe the opposite. The question is about why it is the opposite. — D.J. Elkind, Sep 29 '22 at 07:39
After raising my coffee level above MIN: looking at the numbers again, the first loop is faster just as you expected and with the code in your own answer it is still the same way just with a bit different numbers. — Gerhardh, Sep 29 '22 at 07:41
@Gerhardh More coffee... Above, it seems RowMajor is slower for smaller, but then overtakes ColMajor for larger sizes.... Below (in answer). ColMajor lags on the 2nd trial and only gets worse (as expected)... — Fe2O3, Sep 29 '22 at 07:48
RowMajor is faster for most of the small sizes and all large sizes. Only for 10, 200, 500 its is the other way. I would assume for those sizes, the cache architecture or the paging mechanism add some strange penalty. — Gerhardh, Sep 29 '22 at 07:54
One could, if one had the time, play with the array dimensions, and 'devine' the inner machinations of the allocation scheme; how much does it give and where is the threshold (page-fault) that causes it to give more... Who's got that kinda time? `:-)` — Fe2O3, Sep 29 '22 at 08:00
@Gerhardh and @Fe2O3, the result is actually a bit more than this.--if the dimension number grows slowly (say 10, 15, 20, 25, 30, 35, 40 instead of 10, 20, 50, 100, 200, 500, 1000), the counterintuitive results similar to those in the question will largely (but not completely) disappear. My thought it that, if the dimension grows from 10 to 11, and if `malloc()` reserves a bit of buffer, it may re-use previously `free()'ed` memory and thus the penalty will disappear. — D.J. Elkind, Sep 29 '22 at 08:03
Yes, reusing the buffer would clearly contribute to such an effect. Maybe not freeing the memory would help, assuming sufficiently large heap. ;) — Gerhardh, Sep 29 '22 at 08:09
One way to solve these questions about "Major v. Minor" access efficiency AND make the, imo, "page-fault" slowdown disappear: I'll post your code modified as an answer for you to time-trial if you'd like. Otherwise, just ignore it, and I'll delete it in 24hrs.... — Fe2O3, Sep 29 '22 at 08:31
Hi again. I've removed my "non-answer" that was just code. Cache v. page-faults v. nano-munchkins on amphetamines... It's interesting but one may need far more wizardry to isolate and disable various factors... Another day, another challenge `:-)` — Fe2O3, Sep 29 '22 at 20:34

D.J. Elkind · Answer 1 · 2022-09-29T12:51:02.780

After investigating a bit and per suggestion from @Fe2O3, the code is revised to the following and it works:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int main() {
  size_t dim[] = {10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000};
  struct timespec ts;
  uint32_t* arr_ptr;
  double delta, t0;
  printf(
    "Dim\tArraySize(KB)\tRow-Major Time\tRM Sample\tCol-Major Time\tCM Sample\n"
  );
  for (int i = 0; i < sizeof(dim) / sizeof(dim[0]); ++i) {
    size_t d = dim[i];
    printf("%5lu,\t%11lu,\t", d, d * d * sizeof(uint32_t) / 1024);
    arr_ptr = (uint32_t*)malloc(d * d * sizeof(uint32_t));
    for (int j = 0; j < d; ++j) {
      for (int k = 0; k < d; ++k) {
        *(arr_ptr + j * d + k) = (j * k);
      }
    }
    timespec_get(&ts, TIME_UTC);
    t0 = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0;
    for (int j = 0; j < d; ++j) {
      for (int k = 0; k < d; ++k) {
        *(arr_ptr + j * d + k) += (j + k);
      }
    }
    timespec_get(&ts, TIME_UTC);
    delta = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0 - t0;
    printf("%0.9lf,\t%8u,\t", delta, *(arr_ptr + ts.tv_sec % d * d + ts.tv_nsec % d));
    free(arr_ptr);

    arr_ptr = (uint32_t*)malloc(d * d * sizeof(uint32_t));
    for (int j = 0; j < d; ++j) {
      for (int k = 0; k < d; ++k) {
        *(arr_ptr + k * d + j) = (j * k);
      }
    }
    timespec_get(&ts, TIME_UTC);
    t0 = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0;
    for (int j = 0; j < d; ++j) {
      for (int k = 0; k < d; ++k) {
        *(arr_ptr + k * d + j) += (j + k);
      }
    }
    timespec_get(&ts, TIME_UTC);
    delta = ts.tv_sec + ts.tv_nsec / 1000.0 / 1000.0 / 1000.0 - t0;
    printf("%0.9lf,\t%8u\n", delta, *(arr_ptr + ts.tv_sec % d * d + ts.tv_nsec % d));
    free(arr_ptr);
  }
  return 0;
}

Results:

Dim ArraySize(KB)   Row-Major Time  RM Sample   Col-Major Time  CM Sample
   10,            0,    0.000000238,          39,   0.000000238,           3
   20,            1,    0.000000238,          83,   0.000000477,          27
   50,            9,    0.000000715,         147,   0.000001431,          15
  100,           39,    0.000002623,        5129,   0.000005245,         485
  200,          156,    0.000009060,       15553,   0.000018835,       24023
  500,          976,    0.000072718,        9557,   0.000153065,        7235
 1000,         3906,    0.000302553,       91963,   0.001093388,      282539
 2000,        15625,    0.001767159,     1004401,   0.006136179,      133513
 5000,        97656,    0.010307550,     2686865,   0.045175791,     2730377
10000,       390625,    0.040507793,    44626185,   0.827179193,    53503515
20000,      1562500,    0.206705093,    26209730,   5.742965460,    173268143

The trick is we add one more "preparatory loop" before the timed loop. One possible reason behind it is that malloc() doesn't really provide all the necessary memory upon request. The memory blocks are actually provided immediately before first access--as a result, the first loop could be slower as malloc() is still busy letting us have the memory blocks it promised.

After using the allocated memory for the first time, all things are set, so the 2nd loop behaves exactly as we expect.

More detailed version of answer can be fond here

Thanks @Fe2O3, I also updated the answer, incorporating your thoughts. — D.J. Elkind, Sep 29 '22 at 07:36

Why column-major traversal is actually faster than row-major traversal when 2d array is small

1 Answers1