0

I'm using python and apparently the slowest part of my program is doing simple additions on float variables.

It takes about 35seconds to do around 400,000,000 additions/multiplications.

I'm trying to figure out what is the fastest way I can do this math.

This is how the structure of my code looks like. Example (dummy) code:

def func(x, y, z):
    loop_count = 30
    a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
    b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
    p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
    for i in range(loop_count - 1):
        c = p[i-1]
        d = a[i] + c * a[i+1]
        e = min(2, a[i]) + c * b[i]
        f = e * x
        g = y + d * c
        .... and so on
        p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
    return sum(p)

func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max

I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?

Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.

My constraints:

  • I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
  • I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.
AbdealiLoKo
  • 3,261
  • 2
  • 20
  • 36
  • 1
    numpy is not about parallelization. Are you sure you understand you constraints? – Stephen Rauch Sep 18 '18 at 03:45
  • Note that your code will produce an `IndexError` in the first iteration of the `for` loop because of `p[i-1]` with `i` of `0` value. – blhsing Sep 18 '18 at 03:47
  • @blhsing It would use the last index of p - which in the first iteration is 0. And in the second iteration is the p[0] which is set in the last line – AbdealiLoKo Sep 18 '18 at 03:48
  • @AbdealiJK Ah indeed. Had a brain malfunction moment. – blhsing Sep 18 '18 at 03:49
  • @StephenRauch - removed it to see what answers come about, what i meant to say was that this cannot be modified to become vector math. But maybe my understanding of numpy's capabilities is limited and I am not able to see an easy way to use it. – AbdealiLoKo Sep 18 '18 at 03:50
  • Really, `numpy` is the way of doing this, IMO (unless you want to use C). – AGN Gazer Sep 18 '18 at 03:51
  • @AGNGazer could you elaborate on how I could use numpy here to speed it up? -- I must be missing something – AbdealiLoKo Sep 18 '18 at 03:52
  • ... or maybe not :) However, may I ask if `a`, `b`, and `c` can be pulled out of the "big" loop (i.e., declared as global constants)? Also, your inner loop, as written starts with the last element of `p`. Is this intended? It will also crash when `i` >= 12... – AGN Gazer Sep 18 '18 at 04:02
  • Actually `a`, `b` are generated based on `x` and `y` - but moving it around doesnt help much as calculating a, b takes <2% of the execution time. Yes, the first element is the last value of `p` -> hence at the end of the loop I calculate a value which is used as the first value in the next loop – AbdealiLoKo Sep 18 '18 at 04:04
  • @AGNGazerfixed the >=12 issue ... The lists are big enough to handle it – AbdealiLoKo Sep 18 '18 at 04:14
  • It is difficult to provide a generic answer, except that most likely there is nothing you can do better in "pure" Python to speed up floating point computations.What you need to look at are ways to streamline your computations and try to vectorize them.For example, assuming the only implicit branching that you have though `min()` in your code, I would get rid of it in the loop by moving it before the loop starts as: `a2 = np.maximum(a, 2)`. Now, without this branching, can you compute analytical form of your inner loop computations? Based on what I see in this "dummy" example,I would say yes – AGN Gazer Sep 18 '18 at 04:25
  • It is not clear, whether it is possible to write an efficient numpy-version of your code. However once your data is in numpy arrays and not python-lists you have cython, numba and self-made C-extensions at your disposal. There is potential for speed-up factor 100 in that. Nothing you can do in pure python will come anywhere near it. – ead Sep 18 '18 at 06:34
  • @ead Numpy, numba, and all the others tend to work only if I can write my data in a vectorized form. Int he way it currently is, numpy makes is slower. – AbdealiLoKo Sep 18 '18 at 07:48
  • @AGNGazer tried making a closed form expression. Wasnt able to move everything without branching – AbdealiLoKo Sep 18 '18 at 07:48
  • @AbdealiJK you are mistaken here. Numba&Cython will speed it up, because it produces a way less overhead per addition than Python. See for example https://stackoverflow.com/a/46723823/5769463 – ead Sep 18 '18 at 07:53
  • Can't you replace a[i] by i and make b and p globals ? –  Sep 18 '18 at 07:57
  • @ead Cython definitely did improve speed. It made it 4-5 times as fast by just adding types for the variables. But I have issues with deployment with cython (constraint 2). Numba/Numpy doesnt seem to add much value though. – AbdealiLoKo Sep 18 '18 at 08:09
  • @YvesDaoust - this is just an example code to explain the structure of the code Im dealing with. – AbdealiLoKo Sep 18 '18 at 08:09
  • That doesn't really answer my question. –  Sep 18 '18 at 08:41
  • No, i cant replace a[i] by i as a is not really 1,2,3,4... it is a bunch of values that are calculated from x,y,z. And no cant make it global, as they are in fact dependent on x,y,z -> https://stackoverflow.com/questions/52378578/fastest-way-to-add-multiply-two-floating-point-scalar-numbers-in-python?noredirect=1#comment91701877_52378578 .... Irrespective, the time to calculate a,b is less than 2% of the overall time, so im not too worried about them – AbdealiLoKo Sep 18 '18 at 09:02
  • Could you provide a small complete runnable example so we can play with it? – Joe Sep 18 '18 at 09:14
  • have you tried to save `a[i]` into a variable, say, `ai` so that the array is not accessed multiple times for the same index and then replace `min(2, a[i])` with `2 if 2 < ai else ai`? The use of `min/max` may incur certain [overhead](https://stackoverflow.com/a/3672613/5351549)... – ewcz Sep 18 '18 at 11:43
  • @ewcz - yes, tried IF/ELSE instead of min/max - there was not much speed improvement – AbdealiLoKo Sep 19 '18 at 06:05

1 Answers1

0

I think you should consider using numpy. You did not mention any constraint.

Example case of a simple dot operation (x.y)

import datetime
import numpy as np

x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
    x[i] = x[i] * 0.00001
    y[i] = y[i] * 0.00001

now =  datetime.datetime.now()
z = 0
for i in range(0, len(x)):
    z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z

x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)

now =  datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z

outputs

handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7

numpy is more than 100x times faster.

The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...

PilouPili
  • 2,601
  • 2
  • 17
  • 31
  • While I completely agree with that, i do not have a dot operation here. In the current method, all I can do is do simple + and - with np.float -> which was 3 times slower than doing it with normal python floats PS: I did have a constraint that numpy/numba wont work as i cant vectorize this math, but @stephen Raunch's first comment made me revert that because I assumed maybe I missed something – AbdealiLoKo Sep 19 '18 at 18:26
  • I réalize that you don't have a dot opération but i didn't have a simpler example. And you are not saying everything because judging from thé code you posted i don't sée why it can't bé vectorize. Post thé code you think can't bé vectorized and we'll discuss it then. Sounds good ? – PilouPili Sep 19 '18 at 19:25
  • In the above example, the `p[i]` is being calculated in the last line of the For loop. And `p[i-1]` is being used in the first line of the For loop. And all the calculations internally do not have a closed form expression as it has mins and max. In such a case, I can never compute the next for-loop because p[i-1] needs the previous for loop to execute. Hence IMO not vectorizable – AbdealiLoKo Sep 20 '18 at 17:03