4

I need to compute the signed area of many triangles in 2D, given by a numpy array of shape (2, 3, n) (x/y coordinate, node in triangle, number of triangles). I'm looking for a a way to do it fast, and the best I could come up with so far is

import numpy
import perfplot


def six(p):
    return (
        +p[0][2] * p[1][0]
        + p[0][0] * p[1][1]
        + p[0][1] * p[1][2]
        - p[0][2] * p[1][1]
        - p[0][0] * p[1][2]
        - p[0][1] * p[1][0]
    ) / 2


def mix(p):
    return (
        +p[0][2] * (p[1][0] - p[1][1])
        + p[0][0] * (p[1][1] - p[1][2])
        + p[0][1] * (p[1][2] - p[1][0])
    ) / 2


def mix2(p):
    p1 = p[1] - p[1][[1, 2, 0]]
    return (+p[0][2] * p1[0] + p[0][0] * p1[1] + p[0][1] * p1[2]) / 2


def cross(p):
    e1 = p[:, 1] - p[:, 0]
    e2 = p[:, 2] - p[:, 0]
    return (e1[0] * e2[1] - e1[1] * e2[0]) / 2


def einsum(p):
    return (
        +numpy.einsum("ij,ij->j", p[0][[2, 0, 1]], p[1][[0, 1, 2]])
        - numpy.einsum("ij,ij->j", p[0][[2, 0, 1]], p[1][[1, 2, 0]])
    ) / 2


def einsum2(p):
    return numpy.einsum("ij,ij->j", p[0][[2, 0, 1]], p[1] - p[1][[1, 2, 0]]) / 2


def einsum3(p):
    return (
        numpy.einsum(
            "ij,ij->j", numpy.roll(p[0], 1, axis=0), p[1] - numpy.roll(p[1], 2, axis=0)
        )
        / 2
    )


perfplot.save(
    "out.png",
    setup=lambda n: numpy.random.rand(2, 3, n),
    kernels=[six, mix, mix2, cross, einsum, einsum2, einsum3],
    n_range=[2 ** k for k in range(19)],
)

Any hints on how make it even more efficient?

enter image description here

Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
  • 1
    Interesting question, I'll give it a shot. I wonder why results I'm getting locally are so different from yours though, [test1](https://dl.dropboxusercontent.com/s/otf27gr524tq1on/python_2018-05-18_14-50-20.png), [test2](https://dl.dropboxusercontent.com/s/erqsdjg2cp0dqru/python_2018-05-18_14-51-26.png) :/ – BPL May 18 '18 at 12:51
  • Interesting. Well, could be a different numpy version, different BLAS version and things like that. If you find something that's say, twice as fast, I'm sure it will be an improvement on any machine though. – Nico Schlömer May 18 '18 at 12:53
  • I don't think you can make it simpler than a couple of additions and multiplications, at this point if performance is really a concern you'd need to look into other avenues (Cython, Numba, CUDA, etc.). I'm not sure about the reason for the steep increase of many functions in the 10^4-10^5 step when some are very similar, though... could it be some sort of artifact (`mix` and `six` are the first ones in the benchmark), or maybe the impact of an intermediate copy...? – jdehesa May 18 '18 at 13:04
  • Take a look at Numba or Cython. https://stackoverflow.com/a/50387858/4045774 Your mixed solution has eg. two problems. Bad cache and memory behaviour and a unecessary division (can be replaced by a faster multiplication) – max9111 May 20 '18 at 09:07

4 Answers4

2

I think the problem is not really the computation but how the data are organized in memory. If you don't need to keep your original data you can try to minimize the amount of temporary objects in memory by using inplace function :

def mymix(p):

    p[0][1] -= p[0][0] # x1 = x1 - x0
    p[0][2] -= p[0][0] # x2 = x2 - x0
    p[1][1] -= p[1][0] # y1 = y1 - y0
    p[1][2] -= p[1][0] # y2 = y2 - y0

    p[0][1] *= p[1][2] # x1 = (x1-x0) * (y2-y0)
    p[0][2] *= p[1][1] # x2 = (x2-x0) * (y1-y0)

    p[0][1] -= p[0][2] # r = (x1-x0) * (y2-y0) - (x2-x0) * (y1-y0)
    p[0][1] /= 2

    return p[0][1]
TrapII
  • 2,219
  • 1
  • 15
  • 15
2

You can either use Numba or Cython to optimize this types problems. The best way may probably be to calculate the signed area of the triangles directly when you get the corner points from these triangles. Even with a single threaded version your memory bandwidth can be the limiting factor with those few addiditions and multiplications.

Code

import numba as nb
import numpy 

@nb.njit(fastmath=True)
def mix_nb(p):
    assert p.shape[0]==2
    assert p.shape[1]==3

    res=numpy.empty(p.shape[2],dtype=p.dtype)

    for i in range(p.shape[2]):
      res[i]=(+p[0,2,i] * (p[1,0,i] - p[1,1,i])+ p[0,0,i] * (p[1,1,i] - p[1,2,i])+ p[0,1,i] * (p[1,2,i] - p[1,0,i])) /2.

    return res

#Compile
p= np.random.rand(2, 3, 10000)
A=mix_nb(p)

import perfplot

#Benchmark
perfplot.show(
    setup=lambda n: np.random.rand(2, 3, n),
    kernels=[six, mix, mix2, cross, einsum, einsum2, einsum3,mix_nb],
    n_range=[2**k for k in range(19)],
    logx=True,
    logy=True,
    )

Results

Results

max9111
  • 6,272
  • 1
  • 16
  • 33
1

This is a straightforward version in Cython, which I am including here just for completeness sake:

import numpy as np
from libc.stdlib cimport malloc, calloc, realloc, free

def calculate_signed_areas(double[:,:,::1] triangles):
    cdef:
        int i, n
        double area
        double[::1] areas
        double x0, x1, x2
        double y0, y1, y2

    n = triangles.shape[2]
    areas = <double[:n]>malloc(n * sizeof(double))
    for i in range(n):
        x0 = triangles[0,0,i]
        y0 = triangles[1,0,i]
        x1 = triangles[0,1,i]
        y1 = triangles[1,1,i]
        x2 = triangles[0,2,i]
        y2 = triangles[1,2,i]
        area = (
            + x2 * (y0 - y1)
            + x0 * (y1 - y2)
            + x1 * (y2 - y0)
        )/2.0
        areas[i] = area
    return np.asarray(areas)
CodeSurgeon
  • 2,435
  • 2
  • 15
  • 36
1

I somewhat agree with @TrapII's answer: it is an issue of memory usage. However, I believe it is more important how the data are stored in memory. Let's do the following experiment:

In [1]: import numpy as np

In [2]: p = np.random.random((1000,3,2))

In [3]: p2 = np.array(np.moveaxis(p, (1, 0), (0, 2)).reshape((6, 1000)), order='C')

In [4]: def mix(p):
   ...:     return (
   ...:           p[:,2,0] * (p[:,0,1] - p[:,1,1])
   ...:         + p[:,0,0] * (p[:,1,1] - p[:,2,1])
   ...:         + p[:,1,0] * (p[:,2,1] - p[:,0,1])
   ...:         ) / 2.0
   ...:     

In [5]: def cross2(p2):
   ...:     e0x, e0y, e1x, e1y, e2x, e2y = p2
   ...:     return 0.5 * ((e1x - e0x)*(e2y - e0y) - (e1y - e0y)*(e2x - e0x))
   ...: 

In [6]: %timeit mix(p)
18.5 µs ± 351 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit cross2(p2)
9.03 µs ± 90.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
AGN Gazer
  • 8,025
  • 2
  • 27
  • 45
  • 1
    I also tried with the way the data is stored in memory and gives also a performance boost. If you don't want to use other libs than numpy the good answer is definitely a mix of both. – TrapII May 22 '18 at 07:10