Speeding up user-defined functions

Question

I have a simulation in which the enduser can provide arbitrary many function which get then called in the inner most loop. Something like:

class Simulation:

    def __init__(self):
        self.rates []
        self.amount = 1

    def add(self, rate):
        self.rates.append(rate)

    def run(self, maxtime):
        for t in range(0, maxtime):
            for rate in self.rates:
                self.amount *= rate(t)

def rate(t):
    return t**2

simulation = Simulation()

simulation.add(rate)
simulation.run(100000)

Being a python loop this is very slow, but I can't get to work my normal approaches to speedup the loop.

Because the functions are user defined, I can't "numpyfy" the innermost call (rewriting such that the innermost work is done by optimized numpy code).

I first tried numba, but numba doesn't allow to pass in functions to other functions, even if these functions are also numba compiled. It can use closures, but because I don't know how many functions there are in the beginning, I don't think I can use it. Closing over a list of functions fails:

@numba.jit(nopython=True)
def a()
    return 1

@numba.jit(nopython=True)
def b()
    return 2

fs = [a, b]

@numba.jit(nopython=True)
def c()
    total = 0
    for f in fs:
        total += f()
    return total

c()

This fails with an error:

[...]
  File "/home/syrn/.local/lib/python3.6/site-packages/numba/types/containers.py", line 348, in is_precise
    return self.dtype.is_precise()
numba.errors.InternalError: 'NoneType' object has no attribute 'is_precise' 
[1] During: typing of intrinsic-call at <stdin> (4)

I can't find the source but I think the documentation of numba stated somewhere that this is not a bug but not expected to work.

Something like the following would probably work around calling functions from a list, but seems like bad idea:

def run(self, maxtime):
    len_rates = len(rates)
    f1 = rates[0]
    if len_rates >= 1:
        f2 = rates[1]
    if len_rates >= 2:
        f3 = rates[2]
    #[... repeat until some arbitrary limit]
    @numba.jit(nopython=True)
    def inner(amount):
        for t in range(0, maxtime)
            amount *= f1(t)
            if len_rates >= 1:
                amount *= f2(t)
            if len_rates >= 2:
                amount *= f3(t)
            #[... repeat until the same arbitrary limit]
        return amount

    self.amount = inner(self.amount)

I guess it would also possible to do some bytecode hacking: Compile the functions with numba, pass a list of strings with the names of the functions into inner, do something like call(func_name) and then rewrite the bytecode so that it becomes func_name(t).

For cython just compiling the loop and multiplications will probably speedup a bit, but if the user defined functions are still python just calling the python function will probably still be slow (although I didn't profile that yet). I didn't really found much information on "dynamically compiling" functions with cython, but I guess I would need to somehow add some typeinformation to the user provided functions, which seems.. hard.

Is there any good way to speedup loops with user defined functions without needing to parsing and generating code from them?

@Scovetta I just measured with pypy and unfortunately it needs 5x as long as with normal python. I did no specific pypy optimisations, although I warmed up the JIT. — syntonym, Mar 04 '18 at 13:57
Did you have a look at this workaround? http://numba.pydata.org/numba-doc/dev/user/faq.html 2) Replacing the list rates with a numpy array isn't a possibility? — max9111, Mar 07 '18 at 14:55
@max9111 Putting the functions in a list and using a closure over the list throws an error for me. I think I found somewhere in some numba documentation that this is expected, but I can't find it anymore. 2) The functions can be arbitrary complex, and there can be arbitrary many. I don't think there is some (sane) way to encode this into a numpy array, is there? — syntonym, Mar 07 '18 at 15:29
@max9111 I added a small example using the closure technique and a list of functions and the error it throws. The complete traceback is longer, but I don't think it's super informative, and it should be easy to reproduce. — syntonym, Mar 07 '18 at 15:45

ead · Accepted Answer · 2018-03-10T09:35:14.573

I don't think you can speedup user's function - in the end it is the responsibility of the user to write an efficient code. What you can do, is to give a possibility to interact with your program in an efficient manner without the need to pay for overhead.

You can use Cython, and if the user is also game for using cython, you both could achieve speedups of around 100 compared to pure python-solution.

As baseline, I changed your example a little bit: the function rate does more work.

class Simulation:

    def __init__(self, rates):
        self.rates=list(rates)
        self.amount = 1

    def run(self, maxtime):
        for t in range(0, maxtime):
            for rate in self.rates:
                self.amount += rate(t)

def rate(t):
    return t*t*t+2*t

Yields:

>>> simulation=Simulation([rate])
>>> %timeit simulation.run(10**5)
43.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can use cython to speed things up, first your run function:

%%cython
cdef class Simulation:
    cdef int amount
    cdef list rates
    def __init__(self, rates):
        self.rates=list(rates)
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        for t in range(maxtime):
            for rate in self.rates:
                self.amount *= rate(t)

This gives us almost factor 2:

>>> %timeit simulation.run(10**5)
23.2 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The user could also use Cython to speed-up his calculation:

%%cython
def rate(int t):
  return t*t*t+2*t

>>> %timeit simulation.run(10**5)
7.08 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using Cython gave us already speed-up 6, what is the bottle-neck now? We still are using python for polymorphism/dispatch and this is pretty costly because in order to use it, Python-objects (i.e. Python-integers here) must be created. Can we do better with Cython? Yes, if we define an interface for the function we pass to run at compile time:

%%cython   
cdef class FunInterface:
   cpdef int calc(self, int t):
      pass

cdef class Simulation:
    cdef int amount
    cdef list rates

    def __init__(self, rates):
        self.rates=list(rates)
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        cdef FunInterface f
        for t in range(maxtime):
            for f in self.rates:
                self.amount *= f.calc(t)

cdef class  Rate(FunInterface):
    cpdef int calc(self, int t):
        return t*t*t+2*t

This yield an additional speed-up of 7:

 simulation=Simulation([Rate()])
 >>>%timeit simulation.run(10**5)
 1.03 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The most important part of the code above is line:

self.amount *= f.calc(t)

which no longer needs python for dispatch, but uses a machinery quite similar to virtual functions in c++. This c++-approach has only very small overhead of one indirection/look-up. This also means, that neither result of the function nor the arguments must be converted to Python-objects. For this to work, Rate must be a cpdef-function, you can take a look here for more gory details, how inheritance works for cpdef-functions.

The bottle-neck now is the line for f in self.rates because we still have to do a lot of python-interaction in every step. Here is an example what could be possible, if we could improve on this:

%%cython
.....
cdef class Simulation:
    cdef int amount
    cdef FunInterface f  #just one function, no list

    def __init__(self, fun):
        self.f=fun
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        for t in range(maxtime):
                self.amount *= self.f.calc(t)

...

 >>>  simulation=Simulation(Rate())
 >>> %timeit simulation.run(10**5)
 408 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Another factor 2, but you can decide whether a more complicated code, which will be needed in order to store a list of FunInterface-objects without python-interaction is really worth it.

That locks promising. I need some time to implement cython with my codebase and measure the gains. — syntonym, Mar 04 '18 at 21:39
I implemented all but the last recommodation and together with some other things (boundscheck false, call to numpy replaced with gsl) I got a speedup of 20x. What did run in one hour now runs in 3 minutes. — syntonym, Mar 10 '18 at 09:27

Speeding up user-defined functions

1 Answers1