7

I'm analysing and measuring and getting different results fom my analysis and the measurement. The code is two loops with a data cache with a size of 512 bytes and a block size of 32 bytes:

int SumByColRow (int matrix[M][M], int size)
{
  int i, j, Sum = 0;

  for (j = 0; j < size; j ++) {
    for (i = 0; i < size; i ++) {
      Sum += matrix[i][j];
    }
  }
  return Sum;
}

int SumByRowCol (int matrix[M][M], int size)
{
  int i, j, Sum = 0;

  for (i = 0; i < size; i ++) {
    for (j = 0; j < size; j ++) {
      Sum += matrix[i][j];
    }
  }
  return Sum;
}

I think it should be faster not to switch rows in the inner loop since C stores matrices by row and therefore the SumByRowCol should be faster but in measurement it is the other way. I thought that it would be faster when the cache due to the principle of spatial locality can make the inner loops faster since the values are from consecutive elements? What is the reason that in fact the execution times when measured it is measured that SumByColRow actually is faster?

SumByColRow: Result: 31744
6415.29 us(641529 ticks)
SumByRowCol: Result: 31744
7336.47 us(733647 ticks)

Update

I ran the program again making sure that I'm actually using the data cache and this time the result as as expected, so the above result might be a coincidence and the following is more like it:

SumByColRow: Result: 31744
5961.13 us(596113 ticks)
SumByRowCol: Result: 31744
2328.89 us(232889 ticks)
Niklas Rosencrantz
  • 25,640
  • 75
  • 229
  • 424
  • 1
    Bust out the -s assembly and look at the *real* code generated in release, mode. It shouldn't overtly thick for these two samples. – WhozCraig Oct 08 '13 at 08:36
  • Would like an answer to this question as well. I've always been told the loop with the most cycles should be the topmost loop, but that goes completely against my reasoning. (Mine being if the inner-most loops is the largest, there are the least amount of value assigning). – Lord Zsolt Oct 08 '13 at 08:36
  • 5
    Are those numbers from a single run? Or is it an average of multiple runs? Also, is a difference of 0.9 milliseconds really that important? – Some programmer dude Oct 08 '13 at 08:36
  • Why would you use two for loops to sum an array of arbitrary dimensions? Just treat it as a one-dimensional array or pointer, and sum it in a single loop. Probably the compiler will vectorize it using your architecture's SIMD instructions (if applicable). – Tamás Zahola Oct 08 '13 at 08:44
  • The measured difference is not large but the difference is consistent between runs. @LordZsolt Your reasoning could be consistent with the measurements but I don't understand it. – Niklas Rosencrantz Oct 08 '13 at 08:44
  • 4
    Good question. Please post your compiler and optimization settings, and if you care to, the disassembly of both functions and the relevant values of `M` and `size`. – WhozCraig Oct 08 '13 at 08:58
  • Do you run both the function one after the other or in separate code runs? If you run then one after the other then does changing the order change the results? – Pranaysharma Oct 08 '13 at 09:14
  • A smart compiler could interchange the loops which would make both functions compile to the exact same code. Not all do that though. Maybe the timing difference is an artifact of your testing and initialization rather than the code itself. Have you tested in another order? How does initialization affect the cache? How big is M? Is it possible that everything will be a cache miss and by going by column first you catch a few cache lines left over from initialization? Could the compiler have generated bigger code for one of the functions making the code for one of them cause more cache misses? – Art Oct 08 '13 at 09:21
  • I ran the program again and this time I get the expected results. I might be able to reproduce my first unexpected result if I use a different configuration for my FPGA. But I'm getting the expected results now and I updated the question with it. Thanks for the comments. – Niklas Rosencrantz Oct 08 '13 at 23:21

1 Answers1

4

I can offer a counter-example, closely based on your code.

Code

#include "timer.h"
#include <stdio.h>

enum { M = 128 };

extern int SumByColRow (int matrix[M][M], int size);
extern int SumByRowCol (int matrix[M][M], int size);

int SumByColRow (int matrix[M][M], int size)
{
    int Sum = 0;

    for (int j = 0; j < size; j ++)
    {
        for (int i = 0; i < size; i ++)
            Sum += matrix[i][j];
    }
    return Sum;
}

int SumByRowCol (int matrix[M][M], int size)
{
    int Sum = 0;

    for (int i = 0; i < size; i ++)
    {
        for (int j = 0; j < size; j ++)
            Sum += matrix[i][j];
    }
    return Sum;
}

static inline int max(int i, int j) { return (i > j) ? i : j; }

int main(void)
{
    int matrix[M][M];
    for (int i = 0; i < M; i++)
        for (int j = 0; j < M; j++)
            matrix[i][j] = 1000*i + j;

    Clock clk;
    unsigned long long x[M];
    char buffer[32];
    unsigned long long sum;

    clk_init(&clk);

    clk_start(&clk);
    for (int i = 0; i < M; i++)
        x[i] = SumByColRow(matrix, max(M - i, 10));
    clk_stop(&clk);
    sum = 0;
    for (int i = 0; i < M; i++)
        sum += x[i];
    printf("SumByColRow: value = %llu, time = %s\n", sum, clk_elapsed_us(&clk, buffer, sizeof(buffer)));

    clk_start(&clk);
    for (int i = 0; i < M; i++)
        x[i] = SumByRowCol(matrix, max(M - i, 10));
    clk_stop(&clk);
    sum = 0;
    for (int i = 0; i < M; i++)
        sum += x[i];
    printf("SumByRowCol: value = %llu, time = %s\n", sum, clk_elapsed_us(&clk, buffer, sizeof(buffer)));

    return 0;
}

The two SumBy functions are substantially unchanged (minor notational tweaks, but nothing more). The timing harness stores a start time and a stop time in the Clock structure, and the clk_elapsed_us() function formats the elapsed time in microseconds into the string it is passed.

The messing around with x[i] and so on is to (try and) ensure that the compiler doesn't optimize everything away.

Output

Machine: Mac OS X 10.8.5, GCC (i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)), Intel Core 2 Duo at 2 GHz, 4 GB 1067 MHz DDR3 RAM (an 'Early 2009' Mac Mini).

SumByColRow: value = 33764046316, time = 0.002411
SumByRowCol: value = 33764046316, time = 0.000677

This shows the expected result — that the columns by rows computation is slower because the matrix is big enough to span pages (64 KiB). It is not yet clear from the question what size M is, nor what size is passed to the SumBy functions, but with a 'big enough' array and varying sizes, you can get the expected performance pattern.

Those times aren't big enough for comfort — I'd rather the lower time was of the order of a second or two. Adding a for (int j = 0; j < 1600; j++) loop in front of each of the timed loops in the main program yields:

SumByColRow: value = 33764046316, time = 2.529205
SumByRowCol: value = 33764046316, time = 1.022970

The ratio is smaller (3.56 vs 2.47), but still decidedly tilted in favour of SumByRowCol().

Initializing the matrix 'warms the cache' to the extent it can be warmed. Reversing the order of computation (SumByRowCol before SumByColRow) does not make a significant difference to the timings. The results are pretty consistent across multiple runs.

Assembler output

Compiled with gcc -O3 -std=c99 -S:

    .section        __TEXT,__text,regular,pure_instructions
    .globl  _SumByColRow
    .align  4, 0x90
_SumByColRow:
Leh_func_begin1:
    pushq   %rbp
Ltmp0:
    movq    %rsp, %rbp
Ltmp1:
    testl   %esi, %esi
    jg      LBB1_5
    xorl    %eax, %eax
LBB1_2:
    popq    %rbp
    ret
LBB1_5:
    movl    %esi, %ecx
    xorl    %eax, %eax
    movq    %rcx, %rdx
    jmp     LBB1_6
    .align  4, 0x90
LBB1_3:
    addl    (%r8), %eax
    addq    $512, %r8
    decq    %rsi
    jne     LBB1_3
    addq    $4, %rdi
    decq    %rdx
    je      LBB1_2
LBB1_6:
    movq    %rcx, %rsi
    movq    %rdi, %r8
    jmp     LBB1_3
Leh_func_end1:

    .globl  _SumByRowCol
    .align  4, 0x90
_SumByRowCol:
Leh_func_begin2:
    pushq   %rbp
Ltmp2:
    movq    %rsp, %rbp
Ltmp3:
    testl   %esi, %esi
    jg      LBB2_5
    xorl    %eax, %eax
LBB2_2:
    popq    %rbp
    ret
LBB2_5:
    movl    %esi, %ecx
    xorl    %eax, %eax
    movq    %rcx, %rdx
    jmp     LBB2_6
    .align  4, 0x90
LBB2_3:
    addl    (%r8), %eax
    addq    $4, %r8
    decq    %rsi
    jne     LBB2_3
    addq    $512, %rdi
    decq    %rdx
    je      LBB2_2
LBB2_6:
    movq    %rcx, %rsi
    movq    %rdi, %r8
    jmp     LBB2_3
Leh_func_end2:
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    Good counter example +1. Though i would still wonder what causes the OP to obtain the reverse results – fkl Oct 08 '13 at 09:39
  • 1
    @fayyazkl: I'd like to see all the OP's code, so we can run it for ourselves. I'm perfectly willing to make `timer.[ch]` (440 lines total; a bit big for adding to the answer, but a lot of that is comment) available for anyone who needs it (see my profile for my email address) so that my results can be independently verified. Showing the timing framework would allow us to dismiss concerns about 'misattribution' — the right name associated with the wrong result. And more detail about the platform (chips and o/s and compiler) would be interesting. – Jonathan Leffler Oct 08 '13 at 09:48
  • I'm running it on an Altera DE2 FPGA configured with this cache memory. I'm going to doublecheck that I did all the measurements correctly and I can post the code I used to time the functions but the timing will probably not work in other environments than the Altera Nios 2 IDE. The results posted in the answers is similar to what I would expect but it's contrary to my measurements for my FPGA so I'm going to check that I configured everything correctly when I ran it on the board. I also get similar results in the simulation as when the code in downloaded to the board. – Niklas Rosencrantz Oct 08 '13 at 09:56
  • @909Niklas: hmmm...that is going to make things harder. We can't easily simulate that. What you're seeing is certainly not what I'd expect; what I got is approximately what I'd expect (skewed results — but I would not have wanted to predict the amount of skew). In due course, I recommend adding the platform information to the question — don't leave it buried only in your comment here, please. – Jonathan Leffler Oct 08 '13 at 10:00
  • @JonathanLeffler I was wondering if there was any possibility at all in the Altera FPGA environment's implementation to have stored matrices in different order - row major vs column major and perhaps that would have impact on the results? Just thinking out loud – fkl Oct 08 '13 at 10:20
  • Usually compilers for specific targets implement subsets of standards. Since this is an implementation detail, they might as well have deviated from a routine implementation. @909Niklas what tool chain were you using? a standard one like may be gnu tools or some custom compiler by a vendor? – fkl Oct 08 '13 at 10:22
  • It would impact the results; but it would not be a valid C compiler. The order of the columns and rows is defined by the standard. It is very unlikely to be the issue. But since it is not a mainstream desktop or server computer (or, at least, I don't recognize it as such), funnier things have been known. – Jonathan Leffler Oct 08 '13 at 10:22
  • Oh, i didn't know that was part of standard. I always thought it to be an implementation detail with a generally accepted approach. Could you please quote any reference? I would like to look that up if possible – fkl Oct 08 '13 at 10:24
  • In C99, §6.5.2.1 Array subscripting: _Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n ≥ 2) with dimensions i×j×…×k, then E … is converted to a pointer to an (n-1)-dimensional array with dimensions j×…×k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the pointed-to (n-1)-dimensional array, …. It follows from this that arrays are stored in row-major order (last subscript varies fastest)._ – Jonathan Leffler Oct 08 '13 at 10:31
  • 2
    Thanks - that was helpful. Also found "Row-major order is used in C/C++, PL/I, Python, Speakeasy and others. Column-major order is used in Fortran, MATLAB, GNU Octave, R, Julia, Rasdaman, and Scilab. " http://en.wikipedia.org/wiki/Row-major_order – fkl Oct 08 '13 at 11:28
  • I ran the program again and now I get the expected results. My unexpected results might have been caused be a random misconfiguration. When I make sure that I actually use a data cache then I get magnitudes difference between the two loops in favor of `SumByRowCol` so the reasoning could be correct that it is faster to not switch rows in the innermost loop. – Niklas Rosencrantz Oct 08 '13 at 23:24