Poor maths performance in C vs Python/numpy

Question

Near-duplicate / related:

How does BLAS get such extreme performance? (If you want fast matmul in C, seriously just use a good BLAS library unless you want to hand-tune your own asm version.) But that doesn't mean it's not interesting to see what happens when you compile less-optimized matrix code.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
Matrix Multiplication with blocks

Out of interest, I decided to compare the performance of (inexpertly) handwritten C vs. Python/numpy performing a simple matrix multiplication of two, large, square matrices filled with random numbers from 0 to 1.

I found that python/numpy outperformed my C code by over 10,000x This is clearly not right, so what is wrong with my C code that is causing it to perform so poorly? (even compiled with -O3 or -Ofast)

The python:

import time
import numpy as np

t0 = time.time()
m1 = np.random.rand(2000, 2000)
m2 = np.random.rand(2000, 2000)
t1 = time.time()
m3 = m1 @ m2
t2 = time.time()
print('creation time: ', t1 - t0, ' \n multiplication time: ', t2 - t1)

The C:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {

    clock_t t0=clock(), t1, t2;

    // create matrices and allocate memory  
    int m_size = 2000;
    int i, j, k;
    double running_sum;
    double *m1[m_size], *m2[m_size], *m3[m_size];
    double f_rand_max = (double)RAND_MAX;
    for(i = 0; i < m_size; i++) {
        m1[i] = (double *)malloc(sizeof(double)*m_size);
        m2[i] = (double *)malloc(sizeof(double)*m_size);
        m3[i] = (double *)malloc(sizeof(double)*m_size);
    }
    // populate with random numbers 0 - 1
    for (i=0; i < m_size; i++)
        for (j=0; j < m_size; j++) {
            m1[i][j] = (double)rand() / f_rand_max;
            m2[i][j] = (double)rand() / f_rand_max;
        }
    t1 = clock();

    // multiply together
    for (i=0; i < m_size; i++)
        for (j=0; j < m_size; j++) {
            running_sum = 0;
            for (k = 0; k < m_size; k++)
                running_sum += m1[i][k] * m2[k][j];
            m3[i][j] = running_sum;
        }

    t2 = clock();

    float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
    float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
    printf("creation time: %f", t01 );
    printf("\nmultiplication time: %f", t12 );
    return 0;
}

EDIT: Have corrected the python to do a proper dot product which closes the gap a little and the C to time with a resolution of microseconds and use the comparable double data type, rather than float, as originally posted.

Outputs:

$ gcc -O3 -march=native bench.c
$ ./a.out
creation time: 0.092651
multiplication time: 139.945068
$ python3 bench.py
creation time: 0.1473407745361328
multiplication time: 0.329038143157959

It has been pointed out that the naive algorithm implemented here in C could be improved in ways that lend themselves to make better use of compiler optimisations and the cache.

EDIT: Having modified the C code to transpose the second matrix in order to achieve a more efficient access pattern, the gap closes more

The modified multiplication code:

// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
    for (j=0; j < m_size; j++)
        m3[j][i] = m2[i][j];
// swap the pointers
void *mtemp = *m3;
*m3 = *m2;
*m2 = mtemp;


// multiply together
for (i=0; i < m_size; i++)
    for (j=0; j < m_size; j++) {
        running_sum = 0;
        for (k = 0; k < m_size; k++)
            running_sum += m1[i][k] * m2[j][k];
        m3[i][j] = running_sum;
    }

The results:

$ gcc -O3 -march=native bench2.c
$ ./a.out
creation time: 0.107767
multiplication time: 10.843431
$ python3 bench.py
creation time: 0.1488208770751953
multiplication time: 0.3335080146789551

EDIT: compiling with -0fast, which I am reassured is a fair comparison, brings down the difference to just over an order of magnitude (in numpy's favour).

$ gcc -Ofast -march=native bench2.c
$ ./a.out
creation time: 0.098201
multiplication time: 4.766985
$ python3 bench.py
creation time:  0.13812589645385742
multiplication time:  0.3441300392150879

EDIT: It was suggested to change indexing from arr[i][j] to arr[i*m_size + j] this yielded a small performance increase:

for m_size = 10000

$ gcc -Ofast -march=native bench3.c # indexed by arr[ i * m_size + j ]
$ ./a.out
creation time: 1.280863
multiplication time: 626.327820
$ gcc -Ofast -march=native bench2.c # indexed by art[I][j]
$ ./a.out
creation time: 2.410230
multiplication time: 708.979980
$ python3 bench.py
creation time:  3.8284950256347656
multiplication time:  39.06089973449707

The up to date code bench3.c:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {

    clock_t t0, t1, t2;

    t0 = clock();
    // create matrices and allocate memory
    int m_size = 10000;
    int i, j, k, x, y;
    double running_sum;
    double *m1 = (double *)malloc(sizeof(double)*m_size*m_size),
                *m2 = (double *)malloc(sizeof(double)*m_size*m_size),
                *m3 = (double *)malloc(sizeof(double)*m_size*m_size);
    double f_rand_max = (double)RAND_MAX;

    // populate with random numbers 0 - 1
    for (i=0; i < m_size; i++) {
        x = i * m_size;
        for (j=0; j < m_size; j++)
            m1[x + j] = ((double)rand()) / f_rand_max;
          m2[x + j] = ((double)rand()) / f_rand_max;
            m3[x + j] = ((double)rand()) / f_rand_max;
    }
    t1 = clock();

    // transpose m2 in order to capitalise on cache efficiencies
    // store transposed matrix in m3 for now
    for (i=0; i < m_size; i++)
        for (j=0; j < m_size; j++)
            m3[j*m_size + i] = m2[i * m_size + j];
    // swap the pointers
    double *mtemp = m3;
    m3 = m2;
    m2 = mtemp;


    // multiply together
    for (i=0; i < m_size; i++) {
        x = i * m_size;
        for (j=0; j < m_size; j++) {
            running_sum = 0;
            y = j * m_size;
            for (k = 0; k < m_size; k++)
                running_sum += m1[x + k] * m2[y + k];
            m3[x + j] = running_sum;
        }
    }

    t2 = clock();

    float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
    float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
    printf("creation time: %f", t01 );
    printf("\nmultiplication time: %f", t12 );
    return 0;
}

In the numpy code, you have `m3 = m1 * m2`. That is *element-wise* multiplication. Use `m3 = m1.dot(m2)`, or, if you are using Python 3, `m3 = m1 @ m2`. (I haven't looked closely at your C code, so I don't know if there are any issues there.) — Warren Weckesser, Oct 03 '17 at 19:44
In addition to the above, it's kind of comparing apples to oranges. How have you compiled your C program? Have you enabled any optimizations? Why do you think numpy is operating with a `float`-equivalent, rather than `double`? — Eugene Sh., Oct 03 '17 at 19:45
Maybe a fair comparison would be also a fully implemented code in python. Try [BLAS](http://www.netlib.org/blas/) if you want a top of line performance on C. Numpy as far as i know uses it. — Guto, Oct 03 '17 at 19:47
`%ld` is not necessarily the correct print specifier for `time_t`. Suggest `printf("creation time: %lld", (long long) t1 );` — chux - Reinstate Monica, Oct 03 '17 at 19:51
By the way....How can you detect x10000 difference with this code and 1 second time resolution? — Eugene Sh., Oct 03 '17 at 19:53
Your C accesses `m2[k][j]` in the inner loop over `k`. Compilers aren't smart enough to optimize your naive matmul by transposing first. (And compilers probably aren't allowed to, because there's nowhere to keep the temporary array.) @EugeneSh.: IDK about the 10k, but with gcc7.1.1 `-Ofast -march=native` output on a Skylake i7-6700k takes 37 seconds, running at 0.10 instructions per cycle. According to `perf stat -d`, L1-dcache-loads throughput was 139 M loads/s. The python multiply time is ~0.012 seconds, with perf counters (for the whole process) showing 1977 M loads / sec, 2.28 IPC. — Peter Cordes, Oct 03 '17 at 20:34
Average CPU clocks for both C and python were about 3.8GHz, so those load per second numbers are directly proportional to loads per clock. The C run time is totally dominated by waiting for cache misses, with vastly higher traffic to last-level cache. And it can't take good advantage of SIMD. Even if python was doing a matmul, it would win by a lot because a naive C matmul is garbage. That's why there are optimized libraries for it. — Peter Cordes, Oct 03 '17 at 20:37
going to look for a duplicate. I'm sure naive C matmul isn't a new question. Timing it against an element-wise numpy matmul isn't the important part of this question. — Peter Cordes, Oct 03 '17 at 20:38
But the op should still get the *correct* numbers from numpy — Antti Haapala -- Слава Україні, Oct 03 '17 at 20:52
Related: another 2D matrix loop slowed down by a bad access pattern. https://stackoverflow.com/questions/18262547/i-want-to-optimize-this-short-loop/18263912#18263912 — Peter Cordes, Oct 03 '17 at 20:54
The take-away here is that if you didn't even know that looping over a column-major matrix was slow, you're not even close to being able to write an efficient matmul in C. It's all about the cache, because there's O(n^3) work to do over n^2 memory, so optimizing for high reuse is critical. See https://stackoverflow.com/a/1305744/224132 for comments about thrashing the cache if you loop over sequential memory in A, B, and C like the code in the "How does BLAS get such extreme performance?" question does. (Instead of doing a whole row*column dot product.) — Peter Cordes, Oct 03 '17 at 22:03
re: your update. `-Ofast` would be helpful here if you didn't totally bottleneck on memory, because it allows the compiler to hide the FP-add latency. Actually, it might allow the compiler to do 4 elements of `C` in parallel. See https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction for more about latency vs. throughput and why a reduction loop like `running_sum +=` needs multiple accumulators to bottleneck on throughput instead of latency. — Peter Cordes, Oct 03 '17 at 22:46
I timed with`-Ofast`, and got only 37 seconds. But my desktop has an 8MB L3 cache, and my RAM is fast (DDR4-2666). So probably a lot of the speed difference is in my hardware handling the terrible access pattern better than yours does, but maybe `-Ofast` also helped (if it let the compiler do multiple elements at once). — Peter Cordes, Oct 03 '17 at 22:49
@PeterCordes would -Ofast be a fair comparison? I understood that -Ofast sacrifices some level of mathematical precision? — ezekiel, Oct 04 '17 at 09:42
@EugeneSh. Yes, you're right, I falsely assumed that np.float was the same as C, 32 bits but it is in fact 64 bits, editing to make this correction — ezekiel, Oct 04 '17 at 09:51
@ezekiel: It treats FP math as associative even though it isn't strictly. This allows auto-vectorization by computing `a0+a4+a8` in parallel with `a1+a5+a9`, ... and adding at the end. BLAS libraries are definitely using SIMD, so yes, this is fair. If you link with `-Ofast`, it also adds code before `main()` that sets denormals-are-zero and flush-to-zero, disabling gradual underflow. That has *no* effect if your calculations already didn't underflow. (It also allows optimizations that assume no NaN or infinities, but that's not relevant here). So yes, for *this* case it's a fair comparison. — Peter Cordes, Oct 04 '17 at 16:08
Oh holy crap, I just realized that you're not using 2D arrays, you're using arrays of pointers to arrays. So there might not be any locality between rows of the same matrix, and in the order you allocate them they're interleaved. That might actually be helpful, but the extra overhead for indexing sucks. — Peter Cordes, Oct 04 '17 at 16:11
@PeterCordes yes, I tried using straightforward 2D arrays initially but got segmentation errors for very large matrices. — ezekiel, Oct 04 '17 at 20:31
You mean variable-length local arrays like `float arr[n][m]`? Yeah, of course that segfaults if you run out of stack. Allocate one big buffer with malloc (or better, `aligned_alloc` to a 64-byte boundary) and do the 2D indexing yourself with `y*rows + x`. Or see https://stackoverflow.com/a/8653484/224132 for C99/C11 syntax that still lets you index it with `[y][x]` — Peter Cordes, Oct 04 '17 at 20:47
I can reopen this if you'd like to move some of your updates to an answer. Interesting that it cut the creation time in half to do one big allocation. Each of those `malloc` calls probably involved a TLB invalidation when the kernel updated your process's page table with a new memory mapping. Also interesting that it still made a difference to the multiplication time with such long rows. — Peter Cordes, Oct 05 '17 at 07:26
I reopened it. If you have the time, please move some of the followup edits to the question into an answer. I might write up some of my comments, too. — Peter Cordes, Oct 05 '17 at 07:38
@PeterCordes added a conclusion, welcome any further comments that you might have. — ezekiel, Oct 05 '17 at 09:55
Please post the answer-type part of your question as an **answer** (click answer your own question below) where people can vote on it separately. A question that answers itself isn't a question anymore and isn't a good fit for Stack Overflow. :P — Peter Cordes, Oct 05 '17 at 09:58

ezekiel · Accepted Answer · 2017-10-05T15:34:16.500

1

CONCLUSION: So the original absurd factor of x10,000 difference was largely due to mistakenly comparing element-wise multiplication in Python/numpy to C code and not compiled with all of the available optimisations and written with a highly inefficient memory access pattern that likely didn't utilise the cache. A 'fair' comparison (ie. correct, but highly inefficient single-threaded algorithm, compiled with -Ofast) yields a performance factor difference of x350 A number of simple edits to improve the memory access pattern brought the comparison down to a factor of x16 (in numpy's favour) for large matrix (10000 x 10000) multiplication. Furthermore, numpy automatically utilises all four virtual cores on my machine whereas this C does not, so the performance difference could be a factor of x4 - x8 (depending on how well this program ran on hyperthreading). I consider a factor of x4 - x8 to be fairly sensible, given that I don't really know what I'm doing and just knocked a bit of code together whereas numpy is based on BLAS which I understand has been extensively optimised over the years by experts from all over the place so I consider the question answered/solved.

edited Oct 05 '17 at 15:34

answered Oct 05 '17 at 10:02

ezekiel

427
5
20

I never saw anything like x10k. What the heck is that from? If it's from `gcc -O0`, then stop quoting that number, it's useless. [Benchmarking un-optimized code is not interesting.](https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment/32001196#32001196) – Peter Cordes Oct 05 '17 at 10:11
If you have a 2c4t CPU (2 cores with hyperthreading), that's probably only 2x to maybe 2.6x faster than a single core. HT is good for some things, but a well-tuned matmul can mostly max out the execution units of a physical core using only one logical thread. Only a factor of 4 seems too small for an optimized BLAS vs. C that [thrashes the caches by looping over whole rows, instead of using loop-tiling / cache-blocking to get more reuse and thus more bandwidth](https://stackoverflow.com/a/1305744/224132). But 10x to 8x might be reasonable, especially if you have fast RAM. – Peter Cordes Oct 05 '17 at 10:19
It was with -O3 and several algorithmic mistakes and inefficiencies. A 'fair' comparison (ie. correct, but highly inefficient single-threaded algorithm, compiled with -Ofast) yields a performance factor difference of x350, editing to make this clear. – ezekiel Oct 05 '17 at 10:24

Poor maths performance in C vs Python/numpy

1 Answers1

Linked