Is there a way, with numpy, to compute something for all combinations of n lines in an array (the simple case being all pairs i.e. n=2)

Question

I'm currently playing in python with Runge-Kutta methods for differential equations systems numerical integration, and the scope is (as told in the title) the simulation of planetary orbits.

I'm investigating (comparing) the different ways to accelerate the calculations, and currently I've tried using a C module which quite efficient and I wanted to try with numpy

In this calculation, I need to compute mutual attraction for each pair of planets. Currently, I'm doing this :

import numpy as np

def grav(px, py, M, ax, ay):
    G = 6.67408*10**-2     # m³/s²T
    for b in range(1, len(px)):
        # computing the distance between body #b and all previous
        dx = px[b] - px[:b]
        dy = py[b] - py[:b]
        d2 = dx*dx+dy*dy

        # computing acceleration undergone by b from all previous
        ax[b] = -sum(M[:b]*G * dx * d2**(-1.5))
        ay[b] = -sum(M[:b]*G * dy * d2**(-1.5))

        # adding for each previous, acceleration undergone by from b
        ax[:b] += M[b]*G * dx * d2**(-1.5)
        ay[:b] += M[b]*G * dy * d2**(-1.5)


# input data
system_px = np.array([1., 2., 3., 4., 5., 6., 7., 9., 4., 0.])
system_py = np.array([3., 5., 1., 2., 4., 5., 6., 3., 5., 8.])
system_M  = np.array([3., 5., 1., 2., 4., 5., 6., 3., 5., 8.])

# outout array
system_ax = np.zeros(len(system_px))
system_ay = np.zeros(len(system_px))

grav(system_px, system_py, system_M, system_ax, system_ay)

for i in range(len(system_px)):
    print('body {} mass = {}(ton), position = {}(m), '
          'acceleration = ({:8.4f}, {:8.4f})(m/s²)'.format(i, system_M[i], 
                (system_px[i], system_py[i]), system_ax[i], system_ay[i]))

I wondered if there would be some very general more «numpythonic» way to do this, which could apply to every subset of n lines.

I'd suggest writing a minimal example to test how this code works, better if that is wrapped within a function, so it's easier to grasp input/output. — norok2, May 11 '19 at 10:40
You are close to. Just rewrite it as a function and you'll be good to go. — norok2, May 11 '19 at 17:55
I don't think it will add much but if it can ease your mind... — Camion, May 11 '19 at 22:17
I am sorry I may have misled you, what I meant is that you should rewrite in a form where the input/output of the functionality you want to implement are matched by the input/output of a corresponding function. Right now your output is given as an input. This rewrite has several advantages, but the most important for you to get the help you are asking is that it abstracts away the context of the problem so that people without knowledge of mechanics can still answer your question. — norok2, May 11 '19 at 23:43
Regarding your problem in specific you may want to look into numpy broadcasting. With that you can vectorize all operations quite easily, as soon as you notice that you can compute all pair-wise distances in a single go (but you'll consume `n (n + 1) / 2` more memory) and hence you do not need to sum up the accelerations in two steps. The code would look (for 1D) close to: `dx = x[:, None] - x[None, :]` `a = np.sum(G * M[:, None] * dx, axis=0)` — norok2, May 11 '19 at 23:57
Ok, it works, but it does twice the operations (I suppose it doesn't reduce the operations after detecting that the resulting matrix will be antisymetric - or does'it ?) Even while computing all results twice, it's much faster than my python's loop. However, I had to find out how to get rid of the division by zero when matching twice the same line. — Camion, May 12 '19 at 02:49
On how many points do you want to apply this algorithm? This brute force algorithm does usally not make much sense on more than a few thousend points. (Small point masses doesn't interact significantly with other small point masses far away) — max9111, May 16 '19 at 15:32

Camion · Accepted Answer · 2019-06-04T10:04:12.407

With the information I got thanks to @norok2, I got able to get a much faster solution without the loop, and to partially (i.e. only for n=2) reply the question, but not both at the same time. The solution which replies to the question is about 10 times slower:

import numpy as np

def grav_fast(p, M):
    G = 6.67408*10**-2     # m³/s²T
    d = p[:, :, None] - p[:, None, :]
    d2 = (d*d).sum(axis=0)
    d2[d2==0] = 1
    return (M[None, :, None]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=1)
#or return (M[None, None, :]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=2)
#   (both are equivalent because d is symetric)

def grav_reply(p, M):
    G = 6.67408*10**-2     # m³/s²T
    d = np.tril(p[:, :, None] - p[:, None, :], -1)
    d2 = np.tril((d*d).sum(axis=0), -1)
    d2[d2==0] = 1
    return (M[None, :, None]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=1) - \
           (M[None, None, :]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=2)

# input data
system_p = np.array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  9.,  4.,  0.],
                     [ 3.,  5.,  1.,  2.,  4.,  5.,  6.,  3.,  5.,  8.]])
system_M = np.array([3., 5., 1., 2., 4., 5., 6., 3., 5., 8.])

# output array
l = len(system_p[0])
system_a = np.zeros(shape=(2, l))

for test in 'grav_fast', 'grav_reply':
    print('\ntesting '+test)
    system_a = eval(test+'(system_p, system_M)')

    for i in range(l):
        print('body {} mass = {}(ton), position = {}(m), '
              'acceleration = [{:8.4f} {:8.4f}](m/s²)'.format(i,
                  system_M[i], system_p[:, i], system_a[0, i], system_a[1, i]))

grav_fast doesn't really answer the question because it makes twice the calculations, and make them also for a body attracted by itself (which causes a division by zero), but for a small system, it's still much faster than with the python's loop (break even is around 600 bodies). On the other side, grav_reply might be efficient if np.tril was designed to avoid making the calculations not needed, but it doesn't seem to be the case: A specific test with ipython showed that changing the limit diagonal in np.tril (or np.triu) didn't notably change the execution time.

In [1]: import numpy as np

In [2]: import random

In [3]: a = np.array([[random.randint(10, 99) 
  ....:     for _ in range(5)] 
  ....:     for _ in range(5)])

In [4]: %timeit np.dot(a, a)
1000000 loops, best of 3: 1.35 µs per loop

In [5]: %timeit np.tril(np.dot(a, a), 0)
100000 loops, best of 3: 17.3 µs per loop

In [6]: %timeit np.tril(np.dot(a, a), -2)
100000 loops, best of 3: 16.5 µs per loop

In [7]: a = np.array([[random.randint(10, 99) 
  ....:     for _ in range(100)] 
  ....:     for _ in range(100)])

In [8]: %timeit np.tril(a*a, 0)
10000 loops, best of 3: 56.3 µs per loop

In [9]: %timeit np.tril(a*a, -20)
10000 loops, best of 3: 61 µs per loop

In [10]: %timeit np.tril(a*a, 20)
10000 loops, best of 3: 54.7 µs per loop

In [11]: %timeit np.tril(a*a, 60)
10000 loops, best of 3: 54.5 µs per loop

Edit : Here is a performance/size graph for each algorithm

Edit : Here is the last benchmarking code I wrote:

import numpy as np
import time
import random
from matplotlib import pyplot as plt
from grav_c import grav_c, grav2_c
from numba import jit, njit
import datetime

G = 6.67408*10**-8     # m³/s²T


def grav2(p, M):
    l = len(p[0])
    a = np.empty(shape=(2, l))
    a[:, 0] = 0
    for b in range(1, l):
        # computing the distance between body #b and all previous
        d = p[:, b:b+1] - p[:, :b]
        d2 = (d*d).sum(axis=0)
        d2[d2==0] = 1
        # computing Newton formula : acceleration undergone by b from all previous
        a[:, b] = -(M[:b] * G * d2**(-1.5) * d).sum(axis=1)

        # computing Newton formula : adding for each previous, acceleration undergone by from b
        a[:, :b] += M[b] * G * d2**(-1.5) * d
    return a
grav2_jit = jit(grav2)


def grav(p, M):
    l = len(p[0])
    a = np.empty(shape=(2, l))
    a[:, 0] = 0
    for b in range(1, l):
        # computing the distance between body #b and all previous
        d = p[:, b:b+1] - p[:, :b]
        d2 = (d*d).sum(axis=0)
        d2[d2==0] = 1
        # computing Newton formula : acceleration undergone by b from all previous
        a[:, b] = -(M[:b] * G / np.sqrt(d2) / d2 * d).sum(axis=1)
##        a[:, b] = -(M[:b] * G * d2**(-1.5) * d).sum(axis=1)

        # computing Newton formula : adding for each previous, acceleration undergone by from b
        a[:, :b] += M[b] * G / np.sqrt(d2) / d2 * d
##        a[:, :b] += M[b] * G * d2**(-1.5) * d
    return a
grav_jit = jit(grav)


def grav2_optim1(p, M):
    l = len(p[0])
    a = np.empty(shape=(2, l))
    a[:, 0] = 0
    for b in range(1, l):
        # computing the distance between body #b and all previous
        d = p[:, b:b+1] - p[:, :b]
        d2 = (d*d).sum(axis=0)
        d2[d2==0] = 1
        VVV = G * d2**(-1.5)
        # computing Newton formula : acceleration undergone by b from all previous
        a[:, b] = -(M[:b] * VVV * d).sum(axis=1)

        # computing Newton formula : adding for each previous, acceleration undergone by from b
        a[:, :b] += M[b] * VVV * d
    return a
grav2_optim1_jit = jit(grav2_optim1)


def grav_optim1(p, M):
    l = len(p[0])
    a = np.empty(shape=(2, l))
    a[:, 0] = 0
    for b in range(1, l):
        # computing the distance between body #b and all previous
        d = p[:, b:b+1] - p[:, :b]
        d2 = (d*d).sum(axis=0)
        d2[d2==0] = 1
        VVV = G / np.sqrt(d2) / d2
        # computing Newton formula : acceleration undergone by b from all previous
        a[:, b] = -(M[:b] * VVV * d).sum(axis=1)

        # computing Newton formula : adding for each previous, acceleration undergone by from b
        a[:, :b] += M[b] * VVV * d
    return a
grav_optim1_jit = jit(grav_optim1)


def grav2_optim2(p, M):
    l = len(p[0])
    a = np.empty(shape=(2, l))
    a[:, 0] = 0
    for b in range(1, l):
        # computing the distance between body #b and all previous
        d = p[:, b:b+1] - p[:, :b]
        d2 = (d*d).sum(axis=0)
        d2[d2==0] = 1
        XXX = G * d * d2**(-1.5)
        # computing Newton formula : acceleration undergone by b from all previous
        a[:, b] = -(M[None, :b] * XXX).sum(axis=1)

        # computing Newton formula : adding for each previous, acceleration undergone by from b
        a[:, :b] += M[b] * XXX
    return a
grav2_optim2_jit = jit(grav2_optim2)


def grav_optim2(p, M):
    l = len(p[0])
    a = np.empty(shape=(2, l))
    a[:, 0] = 0
    for b in range(1, l):
        # computing the distance between body #b and all previous
        d = p[:, b:b+1] - p[:, :b]
        d2 = (d*d).sum(axis=0)
        d2[d2==0] = 1
        XXX = G * d / np.sqrt(d2) / d2

        # computing Newton formula : acceleration undergone by b from all previous
        a[:, b] = -(M[None, :b] * XXX).sum(axis=1)

        # computing Newton formula : adding for each previous, acceleration undergone by from b
        a[:, :b] += M[b] * XXX
    return a
grav_optim2_jit = jit(grav_optim2)


def grav2_vect(p, M):
    d = p[:, :, None] - p[:, None, :]
    d2 = (d*d).sum(axis=0)
    d2[d2==0] = 1
    return (M[None, :, None]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=1)
grav2_vect_jit = jit(grav2_vect)

def grav_vect(p, M):
    d = p[:, :, None] - p[:, None, :]
    d2 = (d*d).sum(axis=0)
    d2[d2==0] = 1
    return (M[None, :, None]*G*d/(np.sqrt(d2)*d2)[None, :, :]).sum(axis=1)
grav_vect_jit = jit(grav_vect)

# the grav*_vect_bis functions are equivalent to the grav*_vect functions because d is symetric
def grav2_vect_bis(p, M):
    d = p[:, :, None] - p[:, None, :]
    d2 = (d*d).sum(axis=0)
    d2[d2==0] = 1
    return (-M[None, None, :]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=2)
grav2_vect_bis_jit = jit(grav2_vect_bis)

def grav_vect_bis(p, M):
    d = p[:, :, None] - p[:, None, :]
    d2 = (d*d).sum(axis=0)
    d2[d2==0] = 1
    return (-M[None, None, :]*G*d/(np.sqrt(d2)*d2)[None, :, :]).sum(axis=2)
grav_vect_bis_jit = jit(grav_vect_bis)

def grav2_tril(p, M):
    d = np.tril(p[:, :, None] - p[:, None, :], -1)
    d2 = np.tril((d*d).sum(axis=0), -1)
    d2[d2==0] = 1
    return (M[None, :, None]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=1) - \
           (M[None, None, :]*G*d*(d2**(-1.5))[None, :, :]).sum(axis=2)
grav2_tril_jit = jit(grav2_tril)

def grav_tril(p, M):
    d = np.tril(p[:, :, None] - p[:, None, :], -1)
    d2 = np.tril((d*d).sum(axis=0), -1)
    d2[d2==0] = 1
    return (M[None, :, None]*G*d/(np.sqrt(d2)*d2)[None, :, :]).sum(axis=1) - \
           (M[None, None, :]*G*d/(np.sqrt(d2)*d2)[None, :, :]).sum(axis=2)
grav_tril_jit = jit(grav_tril)


testslist = [
             ('grav_vect', 'c'), ('grav2_vect', 'c--'), ('grav_vect_jit', 'c:'), ('grav2_vect_jit', 'c-.'),
             ('grav_vect_bis', 'm'), ('grav2_vect_bis', 'm--'), ('grav_vect_bis_jit', 'm:'), ('grav2_vect_bis_jit', 'm-.'), 
             ('grav_tril', 'y'), ('grav2_tril', 'y--'), ('grav_tril_jit', 'y:'), ('grav2_tril_jit', 'y-.'),
             ('grav', 'r'), ('grav2', 'r--'), ('grav_jit', 'r:'), ('grav2_jit', 'r-.'), 
             ('grav_optim1', 'g'), ('grav2_optim1', 'g--'), ('grav_optim1_jit', 'g:'), ('grav2_optim1_jit', 'g-.'), 
             ('grav_optim2', 'b'), ('grav2_optim2', 'b--'), ('grav_optim2_jit', 'b:'), ('grav2_optim2_jit', 'b-.'), 
             ('grav_c', 'k'),('grav2_c', 'k--')]

class ScaleType() : pass
class LinScale(ScaleType) : pass
class LogScale(ScaleType) : pass
attempts = 8
scaletype = LogScale
scalelen = 200
scalestart = 2
scalestop = 400


# input data (Multiple datasets to repeat the tests on different data)
randlist = lambda x : [float(random.randint(10000, 99999)) for _ in range(x)]

try:
#    data_file = "Here you can give an npz file name to load some presaved data.npz"
    with np.load(data_file) as data:
        testslist = data['testslist']
        N = data['N']
        timings = data['timings']
        perform = data['perform']
        miny = data['miny']
except NameError:
    L = scalestop-scalestart
    if scalelen > L:
        N = np.arange(scalestart, scalestop+1, 1)
    elif scaletype == LinScale:
        Q = L//(scalelen-1)
        R = L%(scalelen-1)
        N = np.array([i for r in (range(scalestart, scalestart+Q*(scalelen-1-R), Q),
                                  range(scalestart+Q*(scalelen-1-R), scalestop+1, Q+1)) for i in r])
    elif scaletype == LogScale:
        X = scalestart
        G = scalestop/scalestart
        I = scalelen-1
        while True:
            NX = I*np.log(I/np.log(G)/scalestart)/np.log(G)
            if NX-X < 0.0001: break
            X = NX
            L0 = int(scalestart*np.power(G, X/I))
            G = scalestop/(scalestart+L0)
            I = scalelen-1-L0

        a1 = np.array(range(I))
        N = np.concatenate((range(scalestart, scalestart+L0, 1),
                            scalestart+L0-1+np.cumsum((0.+(scalestart+L0)*(np.exp(np.log(G)*(a1+1)/I) - np.exp(np.log(G)*a1/I))).astype(int)),
                            [scalestop]))
    print(N)


    l = len(N)
    timings = np.full(l, 9999999., dtype=[(test[0], np.float64) for test in testslist])
    perform = np.full(l, 9999999., dtype=[(test[0], np.float64) for test in testslist])
    miny = 9999999.

    accum = 0 # This is to prevent system to perform unwanted optimisations
    for j in range(attempts):
        for i in range(l):
            L = N[i]
            system_p = [np.array([randlist(L), randlist(L)]) for _ in range(100)]
            system_M = [np.array( randlist(L)) for _ in range(100)]
            for test in testslist:
                timeref = -time.time()
                system_a = eval(test[0]+'(system_p[0], system_M[0])')
                accum += system_a[0, 0]

                count = 1
                while time.time()+timeref<0.001:
                    for count in range(count+1, 10*count+1):
                        system_a = eval(test[0]+'(system_p[count%100], system_M[count%100])')

                timeref += time.time()
##                print(count)
                timings[test[0]][i] = min(timings[test[0]][i], timeref/count)
                val = timings[test[0]][i]/(N[i]*(N[i]-1)/2)
                perform[test[0]][i] = val
                miny = min(val, miny)
            if i%10==9: print(j, end='', flush=True)
        print(flush=True)

    filename = "example grav, stackoverflow "+str(datetime.datetime.now())+".npz"
    print("saving data to", filename)
    np.savez(filename, testslist=testslist, N=N, timings=timings, perform=perform, miny=miny)


ymin = 10**(np.floor(np.log10(miny)))
if (5*ymin<=miny): ymin *= 5
elif (2*ymin<=miny): ymin *= 2

print('ymin = {}, miny = {}\n'.format(ymin, miny))

figa, ax = plt.subplots(figsize=(24, 12))
for test in testslist:
    ax.plot(N, timings[test[0]], test[1], label=test[0])
ax.set_title('numpy compared timings')
plt.xlabel('N (system size)')
plt.ylabel('timings (msec)')
plt.grid(True)
plt.legend(loc='upper left', bbox_to_anchor=(0., 1), shadow=True, ncol=7)
plt.subplots_adjust(left=0.05, bottom=0.06, right=0.98, top=0.98)


figb, bx = plt.subplots(figsize=(24, 12))
for test in testslist:
    bx.plot(N, timings[test[0]], test[1], label=test[0])
bx.set_title('numpy compared timings')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('N (system size)')
plt.ylabel('timings (msec)')
plt.grid(True)
plt.legend(loc='upper left', bbox_to_anchor=(0., 1), shadow=True, ncol=7)
plt.subplots_adjust(left=0.05, bottom=0.06, right=0.98, top=0.98)


figc, cx = plt.subplots(figsize=(24, 12))
for test in testslist:
    cx.plot(N, perform[test[0]], test[1], label=test[0])
plt.ylim(0, 20*ymin)

cx.set_title('numpy compared performance')
plt.xlabel('N (system size)')
plt.ylabel('performance (msec)/N²')
plt.grid(True)
plt.legend(loc='upper right', bbox_to_anchor=(1., 1), shadow=True, ncol=7)
plt.subplots_adjust(left=0.05, bottom=0.06, right=0.98, top=0.98)


figd, dx = plt.subplots(figsize=(24, 12))
for test in testslist:
    dx.plot(N, perform[test[0]], test[1], label=test[0])
dx.set_title('numpy compared performance')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('N (system size)')
plt.ylabel('performance (msec)/N²')
plt.grid(True)
plt.legend(loc='upper right', bbox_to_anchor=(1., 1), shadow=True, ncol=7)
plt.subplots_adjust(left=0.05, bottom=0.06, right=0.98, top=0.98)

plt.show()

With it's C module

#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <numpy/arrayobject.h>


#define G 6.67408E-8L 

void * failure(PyObject *type, const char *message) {
    PyErr_SetString(type, message);
    return NULL;
}

void * success(PyObject *var){
    Py_INCREF(var);
    return var;
}


static PyObject *
Py_grav_c(PyObject *self, PyObject *args)
{
    PyArrayObject *p, *M;
    PyObject *a;
    int i, j, k;
    double *pq0, *pq1, *Mq0, *Mq1, *aq0, *aq1, *p0, *p1, *a0, *a1;


    if (!PyArg_ParseTuple(args, "O!O!", &PyArray_Type, &p, &PyArray_Type, &M))
        return failure(PyExc_RuntimeError, "Failed to parse parameters.");

    if (PyArray_DESCR(p)->type_num != NPY_DOUBLE)
        return failure(PyExc_TypeError, "Type np.float64 expected for p array.");

    if (PyArray_DESCR(M)->type_num != NPY_DOUBLE)
        return failure(PyExc_TypeError, "Type np.float64 expected for M array.");

    if (PyArray_NDIM(p)!=2)
        return failure(PyExc_TypeError, "p must be a 2 dimensionnal array.");

    if (PyArray_NDIM(M)!=1)
        return failure(PyExc_TypeError, "M must be a 1 dimensionnal array.");

    int K = PyArray_DIM(p, 0);     // Number of dimensions you want
    int L = PyArray_DIM(p, 1);     // Number of bodies in the system
    int S0 = PyArray_STRIDE(p, 0); // Normally, the arrays should be contiguous
    int S1 = PyArray_STRIDE(p, 1); // But since they provide this Stride info
    int SM = PyArray_STRIDE(M, 0); // I supposed they might not be (alignment)

    if (PyArray_DIM(M, 0) != L)
        return failure(PyExc_TypeError, 
                       "P and M must have the same number of bodies.");

    a = PyArray_NewLikeArray(p, NPY_ANYORDER, NULL, 0);
    if (a == NULL)
        return failure(PyExc_RuntimeError, "Failed to create output array.");
    PyArray_FILLWBYTE(a, 0);

    // For all bodies except first which has no previous body
    for (i = 1,
         pq0 = (double *)(PyArray_DATA(p)+S1),
         Mq0 = (double *)(PyArray_DATA(M)+SM),
         aq0 = (double *)(PyArray_DATA(a)+S1);
         i < L;
         i++,
         *(void **)&pq0 += S1,
         *(void **)&Mq0 += SM,
         *(void **)&aq0 += S1
         ) {
        // For all previous bodies
        for (j = 0,
            pq1 = (double *)PyArray_DATA(p),
            Mq1 = (double *)PyArray_DATA(M),
            aq1 = (double *)PyArray_DATA(a);
            j < i;
            j++,
            *(void **)&pq1 += S1,
            *(void **)&Mq1 += SM,
            *(void **)&aq1 += S1
             ) {
            // For all dimensions calculate deltas
            long double d[K], d2 = 0, VVV, M0xVVV, M1xVVV;
            for (k = 0,
                 p0 = pq0,
                 p1 = pq1;
                 k<K;
                 k++,
                 *(void **)&p0 += S0,
                 *(void **)&p1 += S0) {
                d[k] = *p1 - *p0;
            }
            // calculate Hypotenuse squared
            for (k = 0, d2 = 0; k<K; k++) {
                d2 += d[k]*d[k];
            }
            // calculate interm. results once for each bodies pair (optimization)
            VVV = G;
#define LIM 1
//            if (d2<LIM) d2=LIM;                   // Variation on collision case
            if (d2>0) VVV /= d2*sqrt(d2);
            M0xVVV = *Mq0 * VVV;                  // anonymous intermediate result
            M1xVVV = *Mq1 * VVV;                  // anonymous intermediate result
            // For all dimensions calculate component of acceleration
            for (k = 0,
                 a0 = aq0,
                 a1 = aq1;
                 k<K;
                 k++,
                 *(void **)&a0 += S0,
                 *(void **)&a1 += S0) {
                *a0 += M1xVVV*d[k];
                *a1 -= M0xVVV*d[k];
            }
        }
    }

    /*  clean up and return the result */
    return success(a);
}

static PyObject *
Py_grav2_c(PyObject *self, PyObject *args)
{
    PyArrayObject *p, *M;
    PyObject *a;
    int i, j, k;
    double *pq0, *pq1, *Mq0, *Mq1, *aq0, *aq1, *p0, *p1, *a0, *a1;


    if (!PyArg_ParseTuple(args, "O!O!", &PyArray_Type, &p, &PyArray_Type, &M))
        return failure(PyExc_RuntimeError, "Failed to parse parameters.");

    if (PyArray_DESCR(p)->type_num != NPY_DOUBLE)
        return failure(PyExc_TypeError, "Type np.float64 expected for p array.");

    if (PyArray_DESCR(M)->type_num != NPY_DOUBLE)
        return failure(PyExc_TypeError, "Type np.float64 expected for M array.");

    if (PyArray_NDIM(p)!=2)
        return failure(PyExc_TypeError, "p must be a 2 dimensionnal array.");

    if (PyArray_NDIM(M)!=1)
        return failure(PyExc_TypeError, "M must be a 1 dimensionnal array.");

    int K = PyArray_DIM(p, 0);     // Number of dimensions you want
    int L = PyArray_DIM(p, 1);     // Number of bodies in the system
    int S0 = PyArray_STRIDE(p, 0); // Normally, the arrays should be contiguous
    int S1 = PyArray_STRIDE(p, 1); // But since they provide this Stride info
    int SM = PyArray_STRIDE(M, 0); // I supposed they might not be (alignment)

    if (PyArray_DIM(M, 0) != L)
        return failure(PyExc_TypeError, 
                       "P and M must have the same number of bodies.");

    a = PyArray_NewLikeArray(p, NPY_ANYORDER, NULL, 0);
    if (a == NULL)
        return failure(PyExc_RuntimeError, "Failed to create output array.");
    PyArray_FILLWBYTE(a, 0);

    // For all bodies except first which has no previous body
    for (i = 1,
         pq0 = (double *)(PyArray_DATA(p)+S1),
         Mq0 = (double *)(PyArray_DATA(M)+SM),
         aq0 = (double *)(PyArray_DATA(a)+S1);
         i < L;
         i++,
         *(void **)&pq0 += S1,
         *(void **)&Mq0 += SM,
         *(void **)&aq0 += S1
         ) {
        // For all previous bodies
        for (j = 0,
            pq1 = (double *)PyArray_DATA(p),
            Mq1 = (double *)PyArray_DATA(M),
            aq1 = (double *)PyArray_DATA(a);
            j < i;
            j++,
            *(void **)&pq1 += S1,
            *(void **)&Mq1 += SM,
            *(void **)&aq1 += S1
             ) {
            // For all dimensions calculate deltas
            long double d[K], d2 = 0, VVV, M0xVVV, M1xVVV;
            for (k = 0,
                 p0 = pq0,
                 p1 = pq1;
                 k<K;
                 k++,
                 *(void **)&p0 += S0,
                 *(void **)&p1 += S0) {
                d[k] = *p1 - *p0;
            }
            // calculate Hypotenuse squared
            for (k = 0, d2 = 0; k<K; k++) {
                d2 += d[k]*d[k];
            }
            // calculate interm. results once for each bodies pair (optimization)
            VVV = G;
#define LIM 1
//            if (d2<LIM) d2=LIM;                   // Variation on collision case
            if (d2>0) VVV *= pow(d2, -1.5);
            M0xVVV = *Mq0 * VVV;                  // anonymous intermediate result
            M1xVVV = *Mq1 * VVV;                  // anonymous intermediate result
            // For all dimensions calculate component of acceleration
            for (k = 0,
                 a0 = aq0,
                 a1 = aq1;
                 k<K;
                 k++,
                 *(void **)&a0 += S0,
                 *(void **)&a1 += S0) {
                *a0 += M1xVVV*d[k];
                *a1 -= M0xVVV*d[k];
            }
        }
    }

    /*  clean up and return the result */
    return success(a);
}



// exported functions list

static PyMethodDef grav_c_Methods[] = {
    {"grav_c", Py_grav_c, METH_VARARGS, "grav_c(p, M)\n"
"\n"
"grav_c takes the positions and masses of m bodies in Newtonian attraction in a n dimensionnal universe,\n"
"and returns the accelerations each body undergoes.\n"
"input data take the for of a row of fload64 for each dimension of the position (in p) and one row for the masses.\n"
"It returns and array of the same shape as p for the accelerations."},
    {"grav2_c", Py_grav2_c, METH_VARARGS, "grav_c(p, M)\n"
"\n"
"grav_c takes the positions and masses of m bodies in Newtonian attraction in a n dimensionnal universe,\n"
"and returns the accelerations each body undergoes.\n"
"input data take the for of a row of fload64 for each dimension of the position (in p) and one row for the masses.\n"
"It returns and array of the same shape as p for the accelerations."},
    {NULL, NULL, 0, NULL} // pour terminer la liste.
};


static char grav_c_doc[] = "Compute attractions between n bodies.";



static struct PyModuleDef grav_c_module = {
    PyModuleDef_HEAD_INIT,
    "grav_c",   /* name of module */
    grav_c_doc, /* module documentation, may be NULL */
    -1,         /* size of per-interpreter state of the module,
                 or -1 if the module keeps state in global variables. */
    grav_c_Methods
};



PyMODINIT_FUNC
PyInit_grav_c(void)
{
    // I don't understand why yet, but the program segfaults without this.
    import_array();

    return PyModule_Create(&grav_c_module);
}

Why do you say it's twice the calculations? Some parts gets doubled, some do not. Anyway, you could probably reduce the number of calculations by using `np.triu()` and related. But I am not sure this would get you a faster or cleaner solution. — norok2, May 12 '19 at 11:45
In fact, for the purpose of readability, I removed in my initial post, a few optimizations I used to avoid repeating the calculation of intermediary results, and the comparison with the equivalent pure c module which uses those optimizations, shows that this last numpy version is a bit more than four times slower than the c version. And by the way, I'm very glad I learnt how numpy can be used, but my initial question is still about if it is possible to use numpy for a combinations of lines (`n*(n+1)/2 iterations`) while what we are doing is the arrangement with repetitions (`n² iterations`). — Camion, May 13 '19 at 00:59
about `no.triu()`, it seems it will make a triangular matrix, but it will just zero some places. I fear it won't go faster. — Camion, May 13 '19 at 01:01
I meant more the **related functions**, like `np.triu_indices()`. Still I do not think you'll be able to speed things up this way. For a more general discussion including reference to papers for efficiently treating symmetric problems see: https://stackoverflow.com/questions/2572916/numpy-smart-symmetric-matrix Which basically answer your question: no, there is no faster way within NumPy. However, it looks to me that this is something Numba should be able to `jit`. Perhaps you could try that. I can't remember if `numexpr` supports broadcasting, but if it does, that may also speed things up. — norok2, May 13 '19 at 10:53
Hmmm... I had an idea about how to proceed with np.tril, testing being done, but it doesn't speedup things at all. On the contrary, it slowed down things by a factor 10. — Camion, May 14 '19 at 02:35
having done more tests, I observed that for more than 600 bodies in the system, my initial loop is faster than this version. — Camion, May 14 '19 at 07:09
Perhaps you are using a version of NumPy linked to a non-optimized BLAS lib. I'd recommend trying out the version that ships with Anaconda if you are not already using that. — norok2, May 15 '19 at 22:00
No, I don't. I will have a look to that. However, there is no reason to think it won't be the same. If you think about it, my original algorithm is more efficient. the only difference is that it uses a python's loop which slows it down, but for a very large array, the complexity of the whole calculation grows as O(n²) while the complexity of the python's loop trashing only grows as O(n). I will add a graphic I made. — Camion, May 15 '19 at 22:29
Perhaps you should post your C code and include its timings in the plot, just as a reference. — norok2, May 16 '19 at 12:45
I'm currently struggling to rewriting it with numpi to match this testing program, because the original code I'm working on is not numpy compatible and is much too large to use in an example. — Camion, May 17 '19 at 10:22
@Norok2, I didn't forget your question, I have written and tested the code but I have been bugged with other problem which prevented me to come back to this and the are still other test I would like to make based on the various answers, but anyway, I'm posting the current state of my code. — Camion, Jun 04 '19 at 10:00

norok2 · Answer 2 · 2019-05-21T15:10:55.333

I guess that you cannot get a better NumPythonic expression that grav_vectorized(), which, as already noticed, increases the computational complexity and memory by slightly less than a factor of 2.

However, if you are after speed and you still want to stay in the Python lap, for this specific application, it seems like that the evil is in the details and in the input size. Specifically, the timing, at least on my machine, seems to be dominated, for each iteration, by the fractional power, and saving its result in a temporary variable speed things up.

At the end of the day, if you use a Numba JIT version of your original code with the redundancy optimization (grav_optim_jit()) you will be optimal almost always. For very small inputs, the Cython version (grav_loop_cython()) is the fastest, but as soon as the size grows a little (N > ~100), the better optimized Numba code (grav_orig_jit()) takes over the podium. For even larger inputs, the NumPy-optimized but still n (n + 1) / 2 (vectorized in the space dimension) (grav_iter()) becomes the runner-up, despite being the slowest in the small-N regime. Note that the fully vectorized version (grav_vectorized()), performs quite well for small N but falls short as soon as the input size increases. Note also that grav_iter_jit() is essentially not affected (perhaps slightly slowed down) by Numba JIT, as the core optimizations are on the NumPy side.

It is possible that the Cython version could get faster by compiling with -O3 -march=native options, but I have not tried that.

Following is the details of the above.

This is the code I tested:

import numpy as np
import numba as nb

G = 6.67408e-2

MAX_SIZE = 1000
MAX_MASS = 1000

a slightly more polished version of your original code

def grav_orig(x_arr, m_arr, G=G):
    n_dim, n_points = x_arr.shape
    a_arr = np.zeros_like(x_arr, dtype=np.float64)
    for i in range(1, n_points):
        dx_arr = x_arr[0, i] - x_arr[0, :i]
        dy_arr = x_arr[1, i] - x_arr[1, :i]
        d2_arr = dx_arr ** 2 + dy_arr ** 2
        a_arr[0, i] = -np.sum(m_arr[:i] * dx_arr * G * d2_arr**(-1.5))
        a_arr[1, i] = -np.sum(m_arr[:i] * dy_arr * G * d2_arr**(-1.5))
        a_arr[0, :i] += m_arr[i] * dx_arr * G * d2_arr**(-1.5)
        a_arr[1, :i] += m_arr[i] * dy_arr * G * d2_arr**(-1.5)
    return a_arr

the optimized version

def grav_optim(x_arr, m_arr, G=G):
    n_dim, n_points = x_arr.shape
    a_arr = np.zeros_like(x_arr, dtype=np.float64)
    for i in range(1, n_points):
        dx_arr = x_arr[0, i] - x_arr[0, :i]
        dy_arr = x_arr[1, i] - x_arr[1, :i]
        d2_arr = dx_arr ** 2 + dy_arr ** 2
        temp = G * d2_arr**(-1.5)
        temp_x = dx_arr * temp
        temp_y = dy_arr * temp
        a_arr[0, i] = -np.sum(m_arr[:i] * temp_x)
        a_arr[1, i] = -np.sum(m_arr[:i] * temp_y)
        a_arr[0, :i] += m_arr[i] * temp_x
        a_arr[1, :i] += m_arr[i] * temp_y
    return a_arr

the corresponding Numba JITted version:

grav_optim_jit = nb.jit(grav_optim, nopython=True, nogil=True)

a similar approach to the original but vectorizing along the spatial dimensions:

def grav_iter(x_arr, m_arr, G=G):
    n_dim, n_points = x_arr.shape
    a_arr = np.zeros_like(x_arr, dtype=np.float64)
    for i in range(1, n_points):
        d_arr = x_arr[:, i:i + 1] - x_arr[:, :i]
        d2_arr = np.sum(d_arr ** 2, axis=0)
        temp = G * d_arr * d2_arr[None, :]**(-1.5)
        a_arr[:, i] = -np.sum(m_arr[None, :i] * temp, axis=-1)
        a_arr[:, :i] += m_arr[None, i] * temp
    return a_arr

the corresponding Numba JITted version:

grav_iter_jit = nb.jit(grav_iter)

the fully vectorized version:

def grav_vectorized(x_arr, m_arr, G=G):
    d_arr = x_arr[:, :, None] - x_arr[:, None, :]
    d2_arr = np.sum(d_arr ** 2, axis=0)
    d2_arr[d2_arr == 0] = 1
    return np.sum((m_arr[None, :, None] * G * d_arr * d2_arr[None, ...]**(-1.5)), axis=1)

the Cythonized version

%%cython -a
#cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np

cimport cython
cimport numpy as np

DTYPE = np.double
ctypedef np.double_t DTYPE_t

cdef DTYPE_t G = 6.67408e-2

def grav_optim_cython(x_arr, m_arr, G=G):
    n_dim, n_points = x_arr.shape
    a_arr = np.zeros_like(x_arr, dtype=np.float64)
    for i in range(1, n_points):
        dx_arr = x_arr[0, i] - x_arr[0, :i]
        dy_arr = x_arr[1, i] - x_arr[1, :i]
        d2_arr = dx_arr ** 2 + dy_arr ** 2
        temp = G * d2_arr**(-1.5)
        temp_x = dx_arr * temp
        temp_y = dy_arr * temp
        a_arr[0, i] = -np.sum(m_arr[:i] * temp_x)
        a_arr[1, i] = -np.sum(m_arr[:i] * temp_y)
        a_arr[0, :i] += m_arr[i] * temp_x
        a_arr[1, :i] += m_arr[i] * temp_y
    return a_arr

the Cythonized loop-explicit version

%load_ext Cython

%%cython -a
#cython: boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True

import numpy as np

cimport cython
cimport numpy as np

DTYPE = np.double
ctypedef np.double_t DTYPE_t

cdef DTYPE_t G = 6.67408e-2

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def grav_loop_cython(
        np.ndarray[DTYPE_t, ndim=2] x_arr,
        np.ndarray[DTYPE_t, ndim=1] m_arr,
        DTYPE_t G=G):
    cdef int ndim = x_arr.shape[0]
    cdef int n_points = x_arr.shape[1]
    cdef np.ndarray[DTYPE_t, ndim=2] a_arr = np.zeros((ndim, n_points), dtype=DTYPE)
    cdef np.ndarray[DTYPE_t, ndim=2] dx_arr = np.zeros((ndim, n_points - 1), dtype=DTYPE)
    cdef np.ndarray[DTYPE_t, ndim=1] d2_arr = np.zeros((n_points - 1), dtype=DTYPE)
    cdef DTYPE_t temp
    for j in range(1, n_points):
        # compute the pair-wise differences
        for jj in range(j):
            for i in range(ndim):
                dx_arr[i, jj] = x_arr[i, j] - x_arr[i, jj]
        # compute the pair-wise squared Euclidean distances
        for jj in range(j):
            d2_arr[jj] = 0
            for i in range(ndim):
                d2_arr[jj] += dx_arr[i, jj] ** 2.0
        for i in range(ndim):
            for jj in range(j):
                temp = G * dx_arr[i, jj] * d2_arr[jj] ** -1.5
                a_arr[i, j] -= (m_arr[jj] * temp)
                a_arr[i, jj] += (m_arr[j] * temp)
    return a_arr

Timings

I timed all this with the following code:

funcs = (
    grav_orig,
    grav_optim,
    grav_optim_jit,
    grav_optim_cython,
    grav_iter,
    grav_iter_jit,
    grav_loop_cython,
    grav_vectorized,
)

Ns = np.geomspace(1e1, 6e3, 16).astype(int)
timings = np.zeros((len(funcs), len(Ns)))

np.random.seed(0)
for i, N in enumerate(Ns):
    print('N: ', N)
    x_arr = np.random.random((2, N)) * MAX_SIZE
    m_arr = np.random.random((N,)) * MAX_MASS 
    for j, func in enumerate(funcs):
        test_result = np.all(np.isclose(grav_orig(x_arr, m_arr), func(x_arr, m_arr)))
        func_name = func.__name__ + ('_jit' if '__numba__' in dir(func) else '')
        print(f'{func_name:20s} {test_result} ', end='')
        t = %timeit -o func(x_arr, m_arr)
        timings[j, i] = t.best
    print()

In numbers:

# N:  10
# grav_orig            True 501 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim           True 358 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_jit       True 17.4 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# grav_optim_cython    True 383 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter            True 371 µs ± 40.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter_jit        True 481 µs ± 42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_loop_cython     True 12.6 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# grav_vectorized      True 41.3 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# N:  15
# grav_orig            True 769 µs ± 81.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim           True 540 µs ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_jit       True 29.2 µs ± 431 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# grav_optim_cython    True 547 µs ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter            True 602 µs ± 42.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter_jit        True 750 µs ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_loop_cython     True 22.9 µs ± 738 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# grav_vectorized      True 58.1 µs ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# N:  23
# grav_orig            True 1.11 ms ± 55.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim           True 788 µs ± 9.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_jit       True 53 µs ± 290 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# grav_optim_cython    True 825 µs ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter            True 875 µs ± 78.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter_jit        True 1.05 ms ± 70.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_loop_cython     True 49.2 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# grav_vectorized      True 89.6 µs ± 4.89 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# N:  35
# grav_orig            True 1.87 ms ± 94 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim           True 1.35 ms ± 95.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_jit       True 111 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# grav_optim_cython    True 1.36 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter            True 1.31 ms ± 83.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_iter_jit        True 1.54 ms ± 49.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_loop_cython     True 109 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# grav_vectorized      True 159 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# N:  55
# grav_orig            True 3.13 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim           True 2.05 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_jit       True 237 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_cython    True 2.07 ms ± 54.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter            True 2.37 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter_jit        True 2.87 ms ± 86.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_loop_cython     True 263 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_vectorized      True 326 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# N:  84
# grav_orig            True 4.97 ms ± 72.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim           True 3.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_jit       True 484 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_cython    True 3.26 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter            True 3.87 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter_jit        True 4.81 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_loop_cython     True 645 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_vectorized      True 805 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# N:  129
# grav_orig            True 9.22 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim           True 6.14 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_jit       True 1.07 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_optim_cython    True 5.93 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter            True 6.85 ms ± 651 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter_jit        True 7.68 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_loop_cython     True 1.55 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# grav_vectorized      True 1.8 ms ± 39.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# N:  197
# grav_orig            True 17.4 ms ± 374 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim           True 9.57 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_jit       True 2.32 ms ± 30.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_cython    True 9.95 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter            True 9.87 ms ± 660 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter_jit        True 12.3 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_loop_cython     True 3.68 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_vectorized      True 4.03 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# N:  303
# grav_orig            True 31.9 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim           True 15.9 ms ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_jit       True 5.36 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_cython    True 16.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter            True 17.1 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_iter_jit        True 18 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_loop_cython     True 8.38 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_vectorized      True 10.6 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# N:  464
# grav_orig            True 70.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim           True 31.7 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim_jit       True 12.2 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_optim_cython    True 29.3 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_iter            True 28.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_iter_jit        True 32.5 ms ± 737 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_loop_cython     True 19.7 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# grav_vectorized      True 27.8 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# N:  711
# grav_orig            True 126 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim           True 52.7 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim_jit       True 27.1 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim_cython    True 54.1 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_iter            True 54.8 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_iter_jit        True 60.6 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_loop_cython     True 46.8 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_vectorized      True 67.2 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# N:  1089
# grav_orig            True 306 ms ± 31.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim           True 108 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim_jit       True 61.5 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim_cython    True 103 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_iter            True 110 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_iter_jit        True 114 ms ± 5.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_loop_cython     True 107 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_vectorized      True 152 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# N:  1669
# grav_orig            True 567 ms ± 6.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim           True 201 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_jit       True 141 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# grav_optim_cython    True 207 ms ± 5.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter            True 210 ms ± 3.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter_jit        True 223 ms ± 8.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_loop_cython     True 252 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_vectorized      True 365 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# N:  2557
# grav_orig            True 1.28 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim           True 418 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_jit       True 339 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_cython    True 432 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter            True 452 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter_jit        True 470 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_loop_cython     True 605 ms ± 7.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_vectorized      True 817 ms ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# N:  3916
# grav_orig            True 2.83 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim           True 900 ms ± 25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_jit       True 778 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_cython    True 894 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter            True 951 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter_jit        True 991 ms ± 28.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_loop_cython     True 1.41 s ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_vectorized      True 1.88 s ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# N:  6000
# grav_orig            True 6.77 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim           True 1.95 s ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_jit       True 1.84 s ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_optim_cython    True 2.01 s ± 47.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter            True 2.28 s ± 79.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_iter_jit        True 2.27 s ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_loop_cython     True 3.26 s ± 43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# grav_vectorized      True 4.32 s ± 9.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here is the code for generating the plot:

import matplotlib as mpl
import matplotlib.pyplot as plt

subplot_shape = (1, 2)
fig, axs = plt.subplots(
    *subplot_shape, squeeze=False,
    figsize=(8 * subplot_shape[1], 6 * subplot_shape[0]))


ax = axs[0, 0]
ax.set_title('Small-N Regime')
ax.set_xlabel('N / #')
ax.set_ylabel('timing / ms')
small_Ns = tuple(N for N in Ns if N < 1000)
for j, func in enumerate(funcs):
    func_name = func.__name__ + ('_jit' if '__numba__' in dir(func) else '')
    ax.plot(small_Ns, timings[j, :len(small_Ns)] * 1e3, label=func_name)
ax.legend()


ax = axs[0, 1]
ax.set_title('Full N Range')
ax.set_xlabel('N / #')
ax.set_ylabel('timing / ms')
for j, func in enumerate(funcs):
    func_name = func.__name__ + ('_jit' if '__numba__' in dir(func) else '')
    ax.plot(Ns, timings[j] * 1e3, label=func_name)
ax.legend()
plt.show()

EDIT (Julia)

Just for fun, I implemented the same code in Julia (although I am no expert here), but timings were quite delusive for all the bragging they do about speed.

using Random

G = 6.67408e-2

MAX_SIZE = 1000
MAX_MASS = 1000


function grav(
        x_arr::Array{Float64,2},
        m_arr::Array{Float64,2},
        g::Float64=G)::Array{Float64,2}
    n_dim, n_points = size(x_arr)
    a_arr = zeros(size(x_arr))
    for i in 2:n_points
        d_arr = x_arr[:, i:i] .- x_arr[:, 1:i - 1]
        d2_arr = sum(d_arr .^ 2, dims=1)
        temp = G .* d_arr .* (d2_arr .^ -1.5)
        a_arr[:, i] = -sum(m_arr[:, 1:i - 1] .* temp, dims=2)
        a_arr[:, 1:i - 1] += m_arr[:, i - 1] .* temp
    end
    return a_arr
end


N = 10
x_arr = rand(Float64, 2, N) * MAX_SIZE
m_arr = rand(Float64, 1, N) * MAX_MASS
@time grav(x_arr, m_arr)
# 0.000111 seconds (329 allocations: 23.375 KiB)

N = 6000
x_arr = rand(Float64, 2, N) * MAX_SIZE
m_arr = rand(Float64, 1, N) * MAX_MASS
@time grav(x_arr, m_arr)
# 4.112578 seconds (269.17 k allocations: 2.426 GiB, 1.93% gc time)
# BenchmarkTools.Trial: 
#   memory estimate:  2.43 GiB
#   allocs estimate:  269167
#   --------------
#   minimum time:     4.096 s (3.31% GC)
#   median time:      4.169 s (2.50% GC)
#   mean time:        4.169 s (2.50% GC)
#   maximum time:     4.243 s (1.72% GC)
#   --------------
#   samples:          2
#   evals/sample:     1

score 1 · Answer 3 · answered May 16 '19 at 18:14

A Numba approach

There is not much to do to get quite a high speed up of your code.

Join uneccessary loops (all vectorized commands are loops)
Do some math: d2**(-1.5) is a very costly operation which can be replaced with d2 = 1./(np.sqrt(d2)*d2)
Install Intel SVML to get a faster implementation for functions like sin,sqrt,pow...

Code

import numba as nb
import numpy as np

def grav(px, py, M):
    G = 6.67408*10**-2     # m³/s²T
    nPoints=px.shape[0]
    ax=np.zeros(nPoints,dtype=np.float64)
    ay=np.zeros(nPoints,dtype=np.float64)

    for b in range(1, px.shape[0]):
        # computing the distance between body #b and all previous
        dx = px[b] - px[:b]
        dy = py[b] - py[:b]
        d2 = dx*dx+dy*dy

        # computing acceleration undergone by b from all previous
        ax[b] = -np.sum(M[:b]*G * dx * d2**(-1.5))
        ay[b] = -np.sum(M[:b]*G * dy * d2**(-1.5))

        # adding for each previous, acceleration undergone by from b
        ax[:b] += M[b]*G * dx * d2**(-1.5)
        ay[:b] += M[b]*G * dy * d2**(-1.5)
    return ax,ay

@nb.njit(fastmath=True,error_model="numpy")
def grav_2(px, py, M):
    G = 6.67408*10**-2     # m³/s²T
    nPoints=px.shape[0]
    ax=np.zeros(nPoints,dtype=np.float64)
    ay=np.zeros(nPoints,dtype=np.float64)
    for b in range(1, nPoints):
        sum_x=0.
        sum_y=0.
        for i in range(0,b):
            # computing the distance between body #b and all previous
            dx = px[b] - px[i]
            dy = py[b] - py[i]
            d2 = (dx*dx+dy*dy)

            #Much less costly than d2 = d2**(-1.5)
            d2 = 1./(np.sqrt(d2)*d2)

            # computing acceleration undergone by b from all previous
            sum_x += M[i]*G * dx * d2
            sum_y += M[i]*G * dy * d2

            # adding for each previous, acceleration undergone by from b
            ax[i] += M[b]*G * dx * d2
            ay[i] += M[b]*G * dy * d2

        ax[b]=(-1)*sum_x
        ay[b]=(-1)*sum_y
    return ax,ay

Timings

N=10
%timeit res=grav(px, py, M)
212 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=grav_2(px, py, M)
1.29 µs ± 7.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

N=100
%timeit res=grav(px, py, M)
2.86 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res=grav_2(px, py, M)
18.9 µs ± 37.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

N=1000
%timeit res=grav(px, py, M)
86.5 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res=grav_2(px, py, M)
1.79 ms ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

N=10_000
%timeit res=grav(px, py, M)
6.28 s ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res=grav_2(px, py, M)
180 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

N=50_000
#Take a more advandced algorithm
#Small particles doesn't significantly interact with other particles
#which are far away (KD-tree based approaches?)

Oh, that's nice and I have been thinking about using numba, but since I do not own super graphic hardware, I'm not sure I will be able to make use of it :'( — Camion, May 17 '19 at 01:36
@Camion What do you mean with graphics hardware? This is a single threaded CPU Code, you don't need any GPU for that. — max9111, May 17 '19 at 05:54
Isn't numba about parallelizing on GPU ? (I have examined this yet - I'm currently struggling to understand numpy documentation about accessing arrays through C extensions) — Camion, May 17 '19 at 05:57
You can write GPU-code, but most of Numba is for writing code for CPU. This is also a lot easier to start with. — max9111, May 17 '19 at 07:13
I will have a look as soon as I have managed to get numpy interface with c working, and tried Norok2 answer. — Camion, May 17 '19 at 08:09
Upvoted for realizing that the `sqrt()` expression would have been faster than the power one. Do you also know why? However, I do not quite like that adding an extra dim would require extra code, while the logic is all there. — norok2, May 17 '19 at 13:35
@Norok2, my guess is that it is due to the fact that the power version takes 2 arguments, which makes vectorizing more complex. I have made some benchmarking and it appears that, for single data the power version is faster. I hesitated to post it, because it's quite a long piece of code. — Camion, Jun 04 '19 at 08:38
@Caminion for some pure sqrt vs. pow benchmarks look at https://software.intel.com/en-us/mkl-vmperfdata-sqrt vs https://software.intel.com/en-us/mkl-vmperfdata-pow But when mixing up different basic functions (division and sqrt vs pow only) it gets more complicated (you can't simply add the timings) Don't mix SIMD vectorization with Numpy vectorized functions, these are two different things. — max9111, Jun 04 '19 at 09:19

Is there a way, with numpy, to compute something for all combinations of n lines in an array (the simple case being all pairs i.e. n=2)

3 Answers3

a slightly more polished version of your original code

the optimized version

the corresponding Numba JITted version:

a similar approach to the original but vectorizing along the spatial dimensions:

the corresponding Numba JITted version:

the fully vectorized version:

the Cythonized version

the Cythonized loop-explicit version

Timings

EDIT (Julia)

A Numba approach

Linked