-2

I have a simple function that multiplies two matrices.

void mmul1(float A[ni][nk], float B[nk][nj], float C[ni][nj])
{
    int i, j, k;
    for (i=0; i<ni; i++) {
        for (j=0; j<nj; j++) {
            C[i][j] = 0;
            for (k=0; k<nk; k++) {
                C[i][j] += A[i][k]*B[k][j];
            }
        }
    }
}

I have a main function that looks like this:

int main(int argc, char** argv) {

    // timer structs
    struct  timeval ts, te, td;
    float tser, tpar, diff;
    int i, j, k;

    printf("matrix size : %d x %d x %d\n", ni, nj, nk);

    srand(0);

    // initialization
    for (i=0; i<ni; i++) {
        for (k=0; k<nk; k++) {
            A[i][k] = (float)rand()/RAND_MAX;
        }
    }
    for (k=0; k<nk; k++) {
        for (j=0; j<nj; j++) {
            B[k][j] = (float)rand()/RAND_MAX;
        }
    }

    gettimeofday(&ts, NULL);
    for (i=0; i<ni; i++) {
        for (j=0; j<nj; j++) {
            Cans[i][j] = 0;
            for (k=0; k<nk; k++) {
                Cans[i][j] += A[i][k]*B[k][j];
            }
        }
    }
    gettimeofday(&te, NULL);
    timersub(&ts, &te, &td);
    tser = fabs(td.tv_sec+(float)td.tv_usec/1000000.0);

    gettimeofday(&ts, NULL);
    mmul1(A, B, C);
    gettimeofday(&te, NULL);
    timersub(&ts, &te, &td);
    tpar = fabs(td.tv_sec+(float)td.tv_usec/1000000.0);

    // compare results
    diff = compute_diff(C, Cans);

    printf("Performance : %.2f GFlop/s (%.1fX)\n", 2.0*ni*nj*nk/tpar/1000000000, tser/tpar );
    printf("Result Diff : %.3f\n", diff );

    return 0;
}

I am compiling with gcc's -O3 flag.

When testing, I found that if I add static inline to mult's signature, I get a 5X speedup when testing on 512x512 matrices. The overhead of a function call should be negligible compared to the multiplication. Why is this performance penalty occurring (is the compiler generating different machine code?), and how can I fix it without inlineing mult?

tomKPZ
  • 827
  • 1
  • 10
  • 17
  • Are the matrices hard-coded or are they derived from user input? If they're hard-coded, do you get the same behavior when they're derived from user input? – user541686 Jan 19 '15 at 07:53
  • The matrices are filled with rand(). – tomKPZ Jan 19 '15 at 07:54
  • @MohitJain : That information is in the question. – Clifford Jan 19 '15 at 07:56
  • Would it help if I post the objdump of the relevant code? – tomKPZ Jan 19 '15 at 07:57
  • 1
    Can you post the assembler output (gcc -S)? – liorda Jan 19 '15 at 07:58
  • Are the results used after the call to `mult()`? You have unhelpfully elided the code, and omitted the definition of `mult()`. It is possible that the optimizer removes the call altogether if the results are unused and `mult()` has no side effects. – Clifford Jan 19 '15 at 07:58
  • Now did you really do **only one** single multiplication? That's not going to be representative. You should do more (e. g. a million) of them, and then inspect the results. – The Paramagnetic Croissant Jan 19 '15 at 07:59
  • @user1887231 : If you could do that, it would probably answer your question, and you wouldn't need to ask. Why wait to be asked - just do it if you think it will be helpful. – Clifford Jan 19 '15 at 08:00
  • Please post come realistic code. – juanchopanza Jan 19 '15 at 08:03
  • possible duplicate of [Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?](http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of) – phuclv Jan 19 '15 at 08:03
  • Similar: [Why is my program slow when looping over exactly 8192 elements?](http://stackoverflow.com/questions/12264970/why-is-my-program-slow-when-looping-over-exactly-8192-elements?lq=1), [Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?](http://stackoverflow.com/questions/6060985/why-is-there-huge-performance-hit-in-2048x2048-versus-2047x2047-array-multiplica?lq=1) – phuclv Jan 19 '15 at 08:04
  • You have edited the question so that it no longer makes sense - there is now no function `mult()` (called or defined). – Clifford Jan 19 '15 at 19:44
  • After calling `mmul1(A, B, C);`, neither `A`, `B` or `C` are referenced, so the optimiser can remove the code in-lined altogether. So you have your answer - it was guessed before you posted the code. Declare `A`, `B` and `C` `volatile` and see is the apparent "speed up" goes away. – Clifford Jan 19 '15 at 19:49

1 Answers1

1

Since you don't use the results in main, when you inline the function the optimizer can see there are no side-effects that are being used, and is free to remove all of the matrix multiplication code.

You can use gcc's -S flag to look at the generated assembly code.

juanchopanza
  • 223,364
  • 34
  • 402
  • 480
  • While that is indeed a possibility - one I mentioned in the comments, but the code fragment is heavily elided ( lots of `...` ), so how can we know the result is unused? – Clifford Jan 19 '15 at 08:02
  • 2
    @Clifford We'll have to wait and see, if and when OP fixes the question. – juanchopanza Jan 19 '15 at 08:05
  • 1
    I agree. The obvious concern is pointer aliasing. In my experience you should find out if you can force the compiler to assume no pointer aliasing (e.g. use `restrict` if supported) in the relevant function and add a debug check (use the `assert(.)` macro) to enforce it. It's a common issue with numerical analysis code because it's often massing up large similar structures (i.e. big n-dim arrays of `float` or `double`). If you can't enforce it you might need two versions of the function and use the aliasing check as a switch. – Persixty Jan 19 '15 at 08:21