Running C extension in Python faster than plain C

Question

I have implemented a Python extension in C and found that executing a C function inside of Python to be 2x faster than just executing the C code from a C main.

But why is this faster? I would expect the plain C to be exactly the same performance when called from Python as it is when called from C.

Here is my experiment:

Plain C compute code (simple 3for matrix-matrix multiplication)
Plain C main function that calls the mmult() function
Python extension wrapper to call the mmult() function
All timing is happening entirely within the C code

Here are my results:

Pure C - 85us

Python Extension - 36us

Heres my code:

--mmult.cpp----------

#include "mmult.h"

void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]) {

  struct timeval t1, t2;
  gettimeofday(&t1, NULL);

  for(int i=0; i<32; i=i+1) {
    for(int j=0; j<32; j=j+1) {
        int32_t result=0;
         for(int k=0; k<32; k=k+1) {
           result+=a[i*32+k]*b[k*32+j];
         }
         c[i*32+j] = result;
      }
  }

  gettimeofday(&t2, NULL);

  double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000;
  printf("elapsed time: %fus\n",elapsedTime);

}

--mmult.h-------

#include <stdint.h>

void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]);

--main.cpp------

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mmult.h"

int main() {
  int* a = (int*)malloc(sizeof(int)*1024);
  int* b = (int*)malloc(sizeof(int)*1024);
  int* c = (int*)malloc(sizeof(int)*1024);

  for(int i=0; i<1024; i++) {
    a[i]=i+1;
    b[i]=i+1;
    c[i]=0;
  }

  struct timeval t1, t2;
  gettimeofday(&t1, NULL);
  mmult(a,b,c);
  gettimeofday(&t2, NULL);

  double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000;
  printf("elapsed time: %fus\n",elapsedTime);
  free(a);
  free(b);
  free(c);

  return 0;
}

Heres how I compile main:

gcc -o main main.cpp mmult.cpp -O3

--wrapper.cpp-----

#include <Python.h>
#include <numpy/arrayobject.h>
#include "mmult.h"

static PyObject* mmult_wrapper(PyObject* self, PyObject* args) {
   int32_t* a;
   PyArrayObject* a_obj = NULL;
   int32_t* b;
   PyArrayObject* b_obj = NULL;
   int32_t* c;
   PyArrayObject* c_obj = NULL;

   int res = PyArg_ParseTuple(args, "OOO", &a_obj, &b_obj, &c_obj);

   if (!res)
      return NULL;

   a = (int32_t*) PyArray_DATA(a_obj);
   b = (int32_t*) PyArray_DATA(b_obj);
   c = (int32_t*) PyArray_DATA(c_obj);

   /* call function */
   mmult(a,b,c);

   Py_RETURN_NONE;
}

/*  define functions in module */
static PyMethodDef TheMethods[] = {
   {"mmult_wrapper", mmult_wrapper, METH_VARARGS, "your c function"},
   {NULL, NULL, 0, NULL}
};

static struct PyModuleDef cModPyDem = {
   PyModuleDef_HEAD_INIT,
   "mmult", "Some documentation",
   -1,
   TheMethods
};

PyMODINIT_FUNC
PyInit_c_module(void) {
   PyObject* retval = PyModule_Create(&cModPyDem);
   import_array();
   return retval;
}

--setup.py-----

import os
import numpy
from distutils.core import setup, Extension
cur = os.path.dirname(os.path.realpath(__file__))
c_module = Extension("c_module", sources=["wrapper.cpp","mmult.cpp"],include_dirs=[cur,numpy.get_include()])
setup(ext_modules=[c_module])

--code.py-----

import c_module
import time
import numpy as np
if __name__ == "__main__":
    a = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32))
    b = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32))
    c = np.ndarray((32,32),dtype='int32',buffer=np.zeros((32,32),dtype='int32'))

    c_module.mmult_wrapper(a,b,c)

Heres how I compile the Python extension:

python3.6 setup_sw.py build_ext --inplace

UPDATE

Ive updated the mmult.cpp code to run the 3for for 1,000,000 iterations internally. This resulted in very similar times:

Pure C - 27us

Python Extension - 27us

You're using two different compilers, correct? So I'm guessing that one is better at creating efficient executables in this one instance — bendl, Nov 29 '17 at 13:56
_executing a C function inside of Python to be 2x faster than just executing the C code from a C main_ <--- are you sure, how do you benchmarking the speed here, I'm curious to know what made you to come in conclusion that executing C function in Python is faster? — danglingpointer, Nov 29 '17 at 13:56
In most of these cases, the answer is incorrect benchmarking. — Lundin, Nov 29 '17 at 13:58
`int* a = (int*)malloc(sizeof(int)*1024);` -- This number of entries (to me) is a tiny amount to be used for meaningful benchmark tests. — PaulMcKenzie, Nov 29 '17 at 14:04

Basile Starynkevitch · Answer 1 · 2017-11-29T15:29:52.597

85 microseconds is too small a delay to be measured reliably and repeatedly. For example, CPU cache effects (or context switches, or paging) may dominate the computation time (and alter it to make that timing meaningless).

^{(I guess you are on Linux/x86-64)}

As a rule of thumb, try to have a run lasting about half a second at least, and repeat the benchmarking a few times. You could also use time(1) for measurements.

See also time(7). There are several notions of time (elapsed "real" time, monotonic time, process cpu time, thread cpu time, etc...). You could consider using clock(3) or clock_gettime(2) to measure time.

BTW, you might compile with a more recent version of GCC (in November 2017, GCC7 and in a few weeks GCC8) and you want to compile with gcc -march=native -O3 for benchmarking purposes. Try also other optimization options and tuning. You could also try another compiler, e.g. Clang/LLVM.

Look also at this answer (regarding parallelization) to a relevant question. Probably the numpy package is using (internally) similar techniques (outside of the Python GIL), so could be faster than your naive sequential matrix multiplication code in C.

Running C extension in Python faster than plain C

1 Answers1