How can I optimise this Cython code even more?

Question

I am creating a back testing program in Python. At the moment i need a really consistent speed up. With Cython I achieved a 200x speed up but it's not enough. If I ran my code on all of my data it would still take around 16 hours and I would probably need to run it multiple times.

I have used cProfile on my code and find out that this function takes around 98%-99% of all the run time.

import numpy as np
cimport cython
cimport numpy as np
np.import_array()


@cython.wraparound(False)
@cython.boundscheck(False)
@cython.cdivision(True)
cdef tp_sl_back_test(np.ndarray[np.float64_t, ndim=2] data, double tp, double sl):

    cdef double balance = 100
    cdef double balance_copy
    
    cdef Py_ssize_t i

    cdef int right = 0
    cdef int total = 0

    cdef double entry_price
    cdef double close_price
    cdef double high_price
    cdef double low_price
    cdef double tp_price
    cdef double sl_price

    for i in xrange(data.shape[0]):

        balance_copy = balance

        entry_price = data[i, 0]
        high_price = data[i, 1]
        low_price = data[i, 2]
        close_price = data[i, 3]

        tp_price = entry_price + ((entry_price/100) * tp)
        sl_price = entry_price - ((entry_price/100) * sl)

        if (sl_price < low_price) and (tp_price > high_price):
          pass

        elif (sl_price >= low_price) and (tp_price > high_price):
          close_price = sl_price

        elif (sl_price < low_price) and (tp_price <= high_price):
          close_price = tp_price

        else:
           close_price = sl_price

        balance *= 0.9996
        balance += ((close_price - entry_price) * (balance / entry_price))
        balance *= 0.9996

        if balance_copy < balance:
            right += 1
            total += 1
        else:
            total += 1

    return balance, right, total

I am new to Cython and don't know many optimisation techniques. Maybe my code can not be optimised more than that.

I have tried changing np.ndarray[np.float64_t, ndim=2] data to double[:, :] but I got almost no effect.

I need atleast a 800x speed in order to achieve a satisfying result.

Any critic is welcome.

Thanks to everyone in advance.

EDIT:

I have done some changes and the code now looks like this:

@cython.wraparound(False)
@cython.boundscheck(False)
@cython.cdivision(True)
cdef tp_sl_back_test(double[:, :] data, double tp, double sl):

    cdef double balance = 100
    cdef double balance_copy

    cdef Py_ssize_t  i

    cdef int right = 0
    cdef int total = 0

    cdef double entry_price
    cdef double close_price
    cdef double high_price
    cdef double low_price

    cdef double tp_price
    cdef double sl_price

    for i in xrange(data.shape[0]):

        balance_copy = balance

        entry_price = data[i, 0]
        high_price = data[i, 1]
        low_price = data[i, 2]
        close_price = data[i, 3]

        tp_price = entry_price + ((entry_price * 0.01) * tp)
        sl_price = entry_price - ((entry_price * 0.01) * sl)

        if (sl_price < low_price) and (tp_price > high_price):
            pass

        elif sl_price >= low_price:
            close_price = sl_price

        elif tp_price <= high_price:
            close_price = tp_price

        else:
            close_price = sl_price

        balance *= 0.9996
        balance += ((close_price - entry_price) * (balance / entry_price))
        balance *= 0.9996

        if balance_copy < balance:
            right += 1
            total += 1
        else:
            total += 1

    return balance, right, total

I have tried to add some optimisation flags but i get a D9002 error with the message that -O3 is an unknown flag at the moment i'm trying to fix it.

Still haven't tried Numba but will soon.

Kelly Bundy asked for some data information, and here it is:

The data i'm performing testing is 2892 rows long. But this is the data of 1 asset on 1 timeframe(4 hour time frame) + I wuold need to test this with diffente parameters
The function above is called 20.000 times insed of another cython function.
The open_time, volume and color columns are not in the data array

EDIT 2:

This is the full .pyx file: https://pastebin.com/UwEUH5EK

tp_sl_mash is just an array of tp(take profit) and sl(stop loss) combinations it's done by finding the biggest price differencies

    tp_array = np.linspace(0, (high_max + 2 * (high_max / lenght_array)), lenght_array)
    sl_array = np.linspace(0, (low_max + 2 * (low_max / lenght_array)), lenght_array)

    tp_sl_mash = np.array(np.meshgrid(tp_array, sl_array)).T.reshape(-1, 2)

high_max, low_max = find_min_max(data)

def find_min_max(df):

    df['Percent_High'] = df['High_Price'] / df['Open_Price']
    df['Percent_Low'] = df['Low_Price'] / df['Open_Price']

    df_high_max = (df['Percent_High'].max() - 1) * 100
    df_low_max = abs((1 - df['Percent_Low'].min())) * 100

    return df_high_max, df_low_max

You can install the test data from this link https://drive.google.com/file/d/1-kuFJjRRDrEnclgIdoDvMt-dOVlPqPwi/view?usp=sharing

All the main testing file: https://pastebin.com/HVyU8xJy

Have you considered using numba https://numba.pydata.org/? It works well with numpy. When I used a single function very frequently I was able to decrease significantly execution time. It also works well with multithreading. — Karol Adamiak, Feb 09 '23 at 09:16
Are you using an optimisation flags when you compile the code, such as `-O3` or `-ffast-math`? — Matt Pitkin, Feb 09 '23 at 09:20
@KarolAdamiak Cython should be faster, at least according to the information i have found. I have thought about using numba or pythran or cython and, in the end, decided to use Cython. — Maksym Machkovskiy, Feb 09 '23 at 09:21
@MattPitkin Didn't know they could be used in Cython. Will research now. Thanks — Maksym Machkovskiy, Feb 09 '23 at 09:22
I feel there might be improvements using the `apply_over_axis` function in numpy for the computation — Oluwafemi Sule, Feb 09 '23 at 09:33
If you use `setuptools` to build you package then adding extra compiler flags for Cython can be done as in this example https://stackoverflow.com/a/68423687/1862861. — Matt Pitkin, Feb 09 '23 at 09:35
You might squeeze a tiny bit of performance by switching `entry_price/100` to `entry_price * 0.01`. — Matt Pitkin, Feb 09 '23 at 09:36
This code doesn't work properly because balance and balance_copy are the same value? — roganjosh, Feb 09 '23 at 09:36
I was trying to go through the code so see whether you can use `prange` and thread the operation — roganjosh, Feb 09 '23 at 09:38
@MattPitkin At the moment i do not have acces to my PC, will try it in 6 hours. Thanks. — Maksym Machkovskiy, Feb 09 '23 at 09:39
@roganjosh The code works balance and balance_copy are not the same objects so the changes applied to balance does not effect balance_copy. Multithreading probably won't work because of the GIL. — Maksym Machkovskiy, Feb 09 '23 at 09:43
`prange` can remove the GIL. It's an explicit setting `nogil` https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html — roganjosh, Feb 09 '23 at 09:45
Can you give us example input+output and make it clearer what the task is? — Kelly Bundy, Feb 09 '23 at 09:51
@MaksymMachkovskiy Yes. However, I'm not certain if numpy's loops are more performant than those you have written in cython — Oluwafemi Sule, Feb 09 '23 at 10:09
@KellyBundy the data is like more than 100Gb. When I will get acces to my PC i will give an example of input. — Maksym Machkovskiy, Feb 09 '23 at 11:28
It *may* speed-up your code significantly. Lowers memory consumption, data transfer and allows more of your data to fit into CPU cache. Now that I see your data is huge it should have large impact on performance — dankal444, Feb 09 '23 at 11:29
@KellyBundy not exactly, the data in itslef is around 10gb but i need to loop over the data with different parameters. in total it would be more than 100Gb i think around 140Gb — Maksym Machkovskiy, Feb 09 '23 at 11:53
What's the reason you do `balance += ((close_price - entry_price) * (balance / entry_price))` instead of `balance *= close_price / entry_price`? — Kelly Bundy, Feb 09 '23 at 12:00
I mean, mathematically that's equivalent. But this looks like you're trading shares, and `balance / entry_price` looks like the number of shares you buy at entry. Is that right? If so, can you buy arbitrary fractions of shares? Or should you round that down? — Kelly Bundy, Feb 09 '23 at 12:39
@KellyBundy I can buy fractions of shares. But For Long positions it's '''close_price/entry_price''' and for Short positions it's '''entry_price/close_price''' — Maksym Machkovskiy, Feb 09 '23 at 13:24
Then it looks like you don't need loops and Cython. Those different size statements btw are confusing. The estimated 16 hours, that's for how many iterations of the loop? — Kelly Bundy, Feb 09 '23 at 13:39
2892 rows, 20000 times, 16 hours means 1004 rows per second. That can't be right. What am I misunderstanding? — Kelly Bundy, Feb 09 '23 at 17:10
@KellyBundy 16 hours will be the time if I will run this function on all my data. This code executes in 0.48 seconds — Maksym Machkovskiy, Feb 09 '23 at 17:13
So it's 120 million rows per second? And you need it 800 times faster? — Kelly Bundy, Feb 09 '23 at 17:18
@KellyBundy I need it 800 times faster than the pure python version or 4 times faster than the cython one. — Maksym Machkovskiy, Feb 09 '23 at 17:22
@KellyBundy Bro i have just changed ```balance += ((close_price - entry_price) * (balance / entry_price))``` to ```balance *= close_price / entry_price``` and now the code runs in 0,24 seconds, it's a 443x total speed up comparing to the original Python version. Wasn't expecting than much from such a small change, now i will work in that direction. Thanks a lot. — Maksym Machkovskiy, Feb 09 '23 at 20:17
That's more than I expected as well, given all the other stuff you're still doing in the loop. Maybe the compiler had trouble optimizing the original. If you provide test data as text, including tp/sl, then people could try more stuff. I see a few things ... — Kelly Bundy, Feb 09 '23 at 20:36
There is a huge difference between the two: the ordering does not matters in the new expression. This enable the expression to be computed much more efficiently by the processor and the compiler can use SIMD instruction and unroll the loop more aggressively (and use a reduction so to break the dependency chain). Not to mention the two `balance *= 0.9996` can be merged. In fact all of them can be precomputed : `balance *= pow(0.9996, 2*data.shape[0])`. Such details matters a lot. I guess the resulting loop can even be parallelized now with this. The key point is : dependency chains are bad. — Jérôme Richard, Feb 09 '23 at 20:46
@JérômeRichard Yeah, that's what I meant earlier when I said they don't need loops and Cython. That original `balance` update was the only thing that looked like it needs "sequential" computation. I think now the whole loop could instead be done with NumPy vector operations. (Not saying that would be faster, don't have experience with Cython/Numba.) — Kelly Bundy, Feb 09 '23 at 20:53
@JérômeRichard the two ```balance *= 0.9996``` can be merged but i lose perfomance. The ```balance *= 0.9996``` is the comission rate, if i want to merge the two the final formula wuold be something like ```balance = (balance*0.99920016*close_price)/entry_price```. And unfortunately it needs to be sequential (at least i do not see any possible way to do that). — Maksym Machkovskiy, Feb 09 '23 at 21:08
Sequential for what? For the final `balance` or for the final `right`? — Kelly Bundy, Feb 09 '23 at 21:12
@KellyBundy The balance variable is changed every iteration simulating trading, so it needs to be sequntial(At least for what i konw). — Maksym Machkovskiy, Feb 09 '23 at 21:18
If you compute the `entry_price` vector and the `close_price` vector, you have `balance = 100 * np.product(entry_price / close_price)`. I don't know NumPy well, either, but I imagine it can parallelize all of that. — Kelly Bundy, Feb 09 '23 at 21:22
@KellyBundy even if it will work i still need to calculate the `tp_price` and `sl_price` and i still need to do the if statements. Will try now to vectorize. — Maksym Machkovskiy, Feb 09 '23 at 21:32
@KellyBundy I have just tried this, the result changes a lot and it's slower. Honestly i was expecting it to be faster. Anyway thanks for your time. — Maksym Machkovskiy, Feb 09 '23 at 21:41
(oops I forgot the 0.9996 up there, was focused on np.product) — Kelly Bundy, Feb 09 '23 at 21:41
I think those intermediate values should be easy to vectorize. And `balance_copy < balance` is equivalent to `entry_price / close_price < 0.9996**2`. — Kelly Bundy, Feb 09 '23 at 21:46
@KellyBundy I think you are misunderstanding how this code works `balance_copy < balance` is not equivalent to `entry_price / close_price < 0.9996**2` — Maksym Machkovskiy, Feb 09 '23 at 21:52
How so? `balance_copy` is the value before the multiplication with `0.9996 * close_price / entry_price * 0.996`, and `balance` is the value after. You just test whether `balance` got bigger. That's the case iff `0.9996 * close_price / entry_price * 0.996 > 1`, no? — Kelly Bundy, Feb 09 '23 at 21:57
@KellyBundy Yuo'r right `balance_copy < balance` is equivalent to entry_price / close_price < 0.9996**2``. I'm feeling so dumb right now. Even in that case it's still slower the perfomance droped to around 240x from 450x — Maksym Machkovskiy, Feb 09 '23 at 22:18
Note that `-O3` does not work certainly because you are using MSVC on Windows I guess. MSVC requires `/O2` which should be already used by default. Also note that compilers do not use wide SIMD units (AVX/AVX-2) because of retro-compatibility by default. Enabling them may speed up the code. That being said, MSVC tends not to generate a fast code anyway. I advise you to try other compilers like GCC or Clang. — Jérôme Richard, Feb 09 '23 at 22:18
Could you pastebin that function code (just comment the link)? — Kelly Bundy, Feb 09 '23 at 22:20
@JérômeRichard I honestly didn't understand like 90% of your comment. Anyway i have installed GCC via MniGW64. — Maksym Machkovskiy, Feb 09 '23 at 22:22
You can install the test data from this link https://drive.google.com/file/d/1-kuFJjRRDrEnclgIdoDvMt-dOVlPqPwi/view?usp=sharing All the main testing file: https://pastebin.com/HVyU8xJy This is the full .pyx file: https://pastebin.com/UwEUH5EK — Maksym Machkovskiy, Feb 09 '23 at 22:23
As for the double product being faster, this does not make any sense to me. If this happens, this means something really go wrong and understanding why may help to speed the program up. I assumed you enabled fast-math but it might not be the case though (try `/fp:fast` for MSVC and `-ffast-math` otherwise). Fast-math will impact results but if they are too different, then this likely means your current algorithm generate garbadge because of numerical instability (which are frequent when there is a long dependency chain since the error is cumulated and may become exponential with a product) — Jérôme Richard, Feb 09 '23 at 22:24
@JérômeRichard will do it tomorrow, unfortunately I am obligated to leave you guys(It's almost midnight in Europe). Thanks to you JérômeRichard and thanks to Kelly Bundy. I am grateful for your help and time. — Maksym Machkovskiy, Feb 09 '23 at 22:37
@JérômeRichard Isn't product pretty stable? Can you show an example with such an exponential error? I just [tried](https://ato.pxeger.com/run?1=m72soLIkIz9vwYKlpSVpuhY3j2XmFuQXlSjkleYWVCokFivkFXClFeXnKuQmlmQoQCULivJTuKBskDgXV6KCLVClXlFiXkp-LpTSMLI0MNBU0FYw0DM15Sooyswr0QCqAWkuTS7RSNTUhAqCjAALYxUrBtqSCpbCLaejUJRallpUnGobUlSaClQJ8Q3UUzDPAQA) product of 2900 random numbers, and random order, sorted, and reverse sorted are all close together. — Kelly Bundy, Feb 09 '23 at 22:38
@KellyBundy It looks stable indeed, but the error can come from many places and I learned to be careful about numerical stability ;) . For example, if the loop is basically `balance *= close_price / entry_price * 0.9996` and `entry_price/close_price` tends to be very close to 0.9996, then `balance` is basically multiplied by `1+ε`. Thus, a simple optimization like computing `close_price * (1 / entry_price)` with an inaccurate reciprocal can drastically impact the result (`(1+ε)^n-(1-ε)^n` is not negligible for a big `n` if `ε` is not very small). — Jérôme Richard, Feb 09 '23 at 23:05
@KellyBundy In practice, it should be rather fine thanks to a double-precision computation and `n` not being so huge AFAIK. It is also a pathological use-case and the OP may not care much about a high accuracy (as opposed to physicist) though the conditional `balance_copy < balance` might make this a bit more critical. It looks like this is the only problem of numerical instability at first glance for the last version. — Jérôme Richard, Feb 09 '23 at 23:08
@JérômeRichard Right, but that ε *is* very small. And their n is only around 2900, and even with n=10⁸ it's still [very close to 1](https://tio.run/##K6gsycjPM7YoKPr/P60oP1chN7EkQyEztyC/qEShoCg/hYurQsFWwVDfQM/S0lRBSwFMcxUUZeaVaFRoQhkghRrRFbFAeSNLAwNNLOKGBlpaFpqa//8DAA). — Kelly Bundy, Feb 09 '23 at 23:16
@KellyBundy I wonder if `prod` use a naive loop or a pair-wise product having a better numerical stability, but I guess this is the first. In your case, `ε` is I guess 1 ULP so it is indeed very small, but processors have inaccurate reciprocal instruction. IDK, if compilers can use it on double-precision number with fast-math (I really hope no), but I already seen Numba use it for simple-precision with fast-math. On x86, the maximum *relative* error for this approximation is less than 1.5*2^-12 ~= 0.000366 which is not so small compare to 1 DP ULP which is <1e-15 ;). — Jérôme Richard, Feb 09 '23 at 23:37
@JérômeRichard So I guess we can say it does grow exponentially, but the base is very close to 1, and so the exponential growth is still very small for a very long time. — Kelly Bundy, Feb 09 '23 at 23:37
@JérômeRichard 0.000366 instead of <1e-15? If that's true, that sounds like utter garbage that nobody would use... — Kelly Bundy, Feb 09 '23 at 23:42
@KellyBundy Well, I though too, but it was actually useful in some cases where performance is really critical and you are sure your algorithm is (very) numerically stable. This is not so rare in video games where performance matter more than having accurate results : they just need to look realistic ;) . I think it can be useful for image and sound processing too but probably not much more. — Jérôme Richard, Feb 10 '23 at 00:01

AboAmmar · Answer 1 · 2023-02-09T09:34:06.020

0

Have you tried Numba? I don't have actual test data, but the following should be fast. It uses multithreading with fastmath.

import numpy as np
from numba import njit, float64, int32

@njit(fastmath=True)
def tp_sl_back_test(data: np.ndarray, tp: float64, sl: float64):
    balance = 100
    balance_copy = 0
    
    right = total = 0
    entry_price = close_price = high_price = low_price = tp_price = sl_price = 0

    for i in range(data.shape[0]):
        balance_copy = balance

        entry_price = data[i, 0]
        high_price = data[i, 1]
        low_price = data[i, 2]
        close_price = data[i, 3]

        tp_price = entry_price + (entry_price/100) * tp
        sl_price = entry_price - (entry_price/100) * sl

        if sl_price < low_price and tp_price > high_price:
            pass
        elif sl_price >= low_price and tp_price > high_price:
            close_price = sl_price
        elif sl_price < low_price and tp_price <= high_price:
            close_price = tp_price
        else:
            close_price = sl_price

        balance *= 0.9996
        balance += (close_price - entry_price) * balance / entry_price
        balance *= 0.9996

        if balance_copy < balance:
            right += 1
            total += 1
        else:
            total += 1
    return balance, right, total

edited Feb 09 '23 at 09:34

answered Feb 09 '23 at 09:32

AboAmmar

5,439
2
13
24

1

There's no reason that numba should be faster than cython – roganjosh Feb 09 '23 at 09:33
Umm.. let's see if the OP can test and report back. I tried Numba in the past and it gave me good speedups. – AboAmmar Feb 09 '23 at 09:36
@AboAmmar thanks for this code. At the moment i do not have acces to my PC, will try it in 6 hours. – Maksym Machkovskiy Feb 09 '23 at 09:41
3

"*It uses multithreading*" No, it does *not*. I do not know where this common belief come from, but **Numba does not automagically parallelize sequential codes**. The dependency chain on `balance` prevent any automatic parallelization (not sure a manual parallelisation is possible either). Besides, using fastmath is also dangerous because the code is sensitive to numerical instability but the OP's focus is clearly performance so fastmath may be Ok there. – Jérôme Richard Feb 09 '23 at 10:51
@JérômeRichard Thanks, will keep it in mind. – Maksym Machkovskiy Feb 09 '23 at 11:23

How can I optimise this Cython code even more?

1 Answers1