1

I do not really know how to ask this question, but I will try to be as clear as possible.

I am timing a C++ function call from python. The C++ function is wrapped with cython. I am currently timing the python call of the cython function and I get like 52.9 ms with time.time(). On an other hand, I am timing the whole C++ function with the C++ std::chrono::high_resolution_clock library.

The thing is, I am measuring 17.1 ms in C++.

The C++ function is declared like this vector<float> cppfunc(vector<float> array, int a, int b, int c); and is a class A method.

The cython code only calls the C++ class method. The vector contains about 320k elements.

I was wondering if these two measured times could be compared like this? If we can, what can explain this gap? If not, which timing tool should I use?

Edit1: (links in comments) both timing libraries are precise enough for my use case (10e-9 for cpp on my arch and 10e-6 for python).

Edit2: added simplified code to illustrate my point. With this code, the python call duration (~210ms) is 8x the intern cpp duration (~28ms).

// example.cpp
#include "example.h"
#include <iostream>
#include <chrono>

std::vector<float> wrapped_function(std::vector<float> array)
{
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<float> result;
    for (int i = 0; i < (int) array.size(); i++) {
        result.push_back(array[i] / 1.1);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<float> duration = end - start;
    printf("Within duration: %.5f\n", duration.count());
    return result;
}


// example.h
#ifndef __EXAMPLE_H_
#define __EXAMPLE_H_
#include <vector>
std::vector<float> wrapped_function(std::vector<float> array);
#endif


# example_wrapper.pxd
from libcpp.vector cimport vector
cdef extern from "example.h":
    vector[float] wrapped_function(vector[float])


# example_wrapper.pyx
from example_wrapper cimport wrapped_function
def cython_wrap(array):
    return wrapped_function(array)


# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
      cmdclass = {"build_ext": build_ext},
      ext_modules = [
            Extension(name="example_wrapper",
                      sources=["example_wrapper.pyx", "example.cpp"],
                      include_dirs=["/home/SO/"],
                      language="c++",
                      extra_compile_args=["-O3", "-Wall", "-std=c++11"]
                      )
            ]
)


# test.py
import example_wrapper
from time import time

array = [i for i in range(1000000)]
t0 = time()
result = example_wrapper.cython_wrap(array)
t1 = time()
print("Wrapped duration: {}".format(t1 - t0))
Romzie
  • 457
  • 4
  • 11
  • You should investigate how precise each timing method is. Clocks have a granularity which can be as much as several 10s of milliseconds. – Richard Critten May 29 '18 at 15:12
  • Does you C++ timing include the copying of the huge vector parameter ? Or are you just timing within the function itself ? – Paul R May 29 '18 at 15:14
  • I am just timing within the function itself, now that you say it, it could be highly related to the size of the array. – Romzie May 29 '18 at 15:17
  • According to [this](https://stackoverflow.com/questions/8386128/how-to-get-the-precision-of-high-resolution-clock?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) and [this](https://stackoverflow.com/questions/1938048/high-precision-clock-in-python?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) post, both timing methods have significant precision for my application. So the only time I have not measured is the copy of the parameter which is massive. Any hints on a better way to give a numpy array to a cpp function? – Romzie May 29 '18 at 15:38
  • I first tought this question had to be theoretical and I was looking for general advices, but you're right, so I added an example source code. Sorry for the inconvenience. – Romzie May 30 '18 at 09:10

1 Answers1

1

Obviously the difference is from cython overhead, but why it is so big?

The call of the wrapped function is more complex than meets the eye:

def cython_wrap(array):
       return wrapped_function(array)

array is a list of integers, the wrapped_function expects vector of floats, so the cython creates automatically a vector and populates it with values from the list.

The wrapped_function returns a vector of floats, but in order to be used by python it must be converted to a python-list. Once again cython automatically creates a python-list and populates it with python-floats which are quite costly to construct and correspond to the floats from the returned vector.

As you can see, a lot of copying is going on and this explains the overhead you are observing.

Here is the set of rules cython automatically applies when converting from c++-containers to python.

Another issue: you are passing the vector array by value, so it has to be copied. Your timing of c++ code does not include this copying and thus it is a little bit unfair.

You should pass the vector by const-reference, i.e.

... wrapped_function(const std::vector<float> &array)

One more thing: you return a vector which possible will be copied as well and this copying-time is once again not included in your c++-timings. However all modern compilers apply return value optimization, so this is not an issue here.

ead
  • 32,758
  • 6
  • 90
  • 153
  • I followed your wise advice and got a 5 to 8% speedup thanks to the const reference. Is this what you had in mind? Because I find it a bit disappointing, I was expecting 20 to 30%... I also tried without the return value and it indeed speeds up the python call by 20%. – Romzie May 30 '18 at 14:52
  • Your question was “what explains the gap?” and not “what to do to speed up?” - the explanation is what this answer delivers. There are either so many calculation in the cpp part that the python-cpp conversion doesn’t matter or you have to avoid conversion. – ead May 30 '18 at 15:26