Why vectorized computations are slower than the for loop equivalent

Question

I have two codes to calculate a function. One is based on python for loops as below:

@nb.autojit()
def ODEfunction(r):
    tic = time.process_time()
    NP=10
    s=1
    l=100
    f = np.zeros(len(r))
    lbound=-4* (12*s**12/(-0.5*l-r[0])**13-6*s**6/(-0.5*l-r[0])**7)
    rbound=-4* (12*s**12/(0.5*l-r[NP-1])**13-6*s**6/(0.5*l-r[NP-1])**7)

    f[0:NP]=r[NP:2*NP]

    for i in range(NP):
        fi = 0.0
        for j in range(NP):
            if (j!=i):
                fij = -4*(12*s**12/(r[j]-r[i])**13-6*s**6/(r[j]-r[i]) ** 7)
                fi = fi + fij
        f[i+NP]=fi

    f[NP]=f[NP]+lbound
    f[2*NP-1]=f[2*NP-1]+rbound

    toc = time.process_time()
    print(toc-tic)
    return f

The other is the equivalent vectorized version of this code:

@nb.autojit()
def ODEfunction(r):
    tic=time.process_time()
    NP=10
    s=1
    l=100
    f = np.zeros(len(r))
    lbound=-4* (12*s**12/(-0.5*l-r[0])**13-6*s**6/(-0.5*l-r[0])**7)
    rbound=-4* (12*s**12/(0.5*l-r[NP-1])**13-6*s**6/(0.5*l-r[NP-1])**7)

    f[0:NP]=r[NP:2*NP]

    ri=r[0:NP]
    rj = r[0:NP]
    rij=np.subtract.outer(rj,ri)
    fij = -4 * (12 * s ** 12 / (rij) ** 13 - 6 * s ** 6 / (rij) ** 7)
    fij[np.diag_indices(NP)]=0
    f[NP:2*NP] = fij.sum(axis=0)

    f[NP]=f[NP]+lbound
    f[2*NP-1]=f[2*NP-1]+rbound

    toc=time.process_time()
    print(toc-tic)
    return f

In both the input r is a numpy 1 by 20 array and as you see I am using numba to accelerate the codes. It is surprising that in this case, the vectorized code is 5 times slower than the for loops. I have seen a similar problem before in some posts like here: Numpy: Single loop vectorized code slow compared to two loop iteration

However, the problem there is the huge size of arrays. As you see in my case there is not any big array involved. does anybody know the reason and how it might be solved?

Why do people writing numerical calculations, always pack their code like white space was unaffordable? — Reblochon Masque, May 24 '18 at 06:46
Your measurements are useless... Remove the `time.process_time` calculation from there and use the `timeit` module to run your function multiple times. — Giacomo Alzetta, May 24 '18 at 06:54
This function is part of a bigger code so it runs several times when I run the whole code. Anyway, the difference is so evident that I even don't need these measurements. I can see the difference by using the stopwatch of my smart phone. — Arash, May 24 '18 at 06:57
Are you using `numba`? If so that should be clear in the tags and text. `numba` can make iterative code fast. The usual 'rules' about vectorization don't apply. — hpaulj, May 24 '18 at 07:21
This is absolutely expected. Another example https://stackoverflow.com/a/50387858/4045774 . The vectorized version has a function call, which Numba does not support (np.subtract.outer). Even if it would support this, every vectorized command is reprecented with (nested) loops which are much harder to optimize for a compiler than your much simpler first version. Actually a optimal compiler would convert your second to your first version. But this is really complicated to do automatically. Apart from that, also the first version can be further optimized.... — max9111, May 24 '18 at 07:48
max9111 Thank you for the reply, do you know how the first can be optimized? — Arash, May 24 '18 at 07:53
@ReblochonMasque Because sending numerical code with many white spaces to one's collaborators takes forever? https://youtu.be/gsNaR6FRuO0 — AGN Gazer, May 24 '18 at 07:55
@Arash It is important to provide Testdata for your code because some optimization technics like parallelization depends on the size of the problem. Also Add some information on the precision you need. eg. Is float32 enough or do you need double precision (float64) — max9111, May 24 '18 at 07:58
@max9111 I don't know what do you mean exactly by Testdata? You mean some example of the input array?.. This function is called several times in another loop which itself is running for millions of times(100 million times roughly)however the precision of float 32 is enough for me. — Arash, May 24 '18 at 08:13
@max9111 The whole code is a simulation of the dynamics of Argon molecules in a gas and this function is especially the most time-consuming part and it is calculating the forces on all molecules in each time step. The simulation should run for some dozen of million time steps. — Arash, May 24 '18 at 08:22
Yes I meant input arrays. The variable t is also unused in your code. Working on an answer... s is a global variable. l is also misssing.. — max9111, May 24 '18 at 08:24
OK. I will correct them s=1 and t is unused as you said in this case — Arash, May 24 '18 at 08:29
@max9111 I corrected the point and also the r is 1 by 20 not 1 by 10. I don't have any example of r array now on this computer but its elements can be any real number. The first 10 elements are always in the range -0.5*l to 0.5*l. The second ten elements are not confined but usually not over the range -100000 to 100000 — Arash, May 24 '18 at 08:51

Why vectorized computations are slower than the for loop equivalent

0 Answers0