Fastest way in numpy to sum over upper triangular elements with the least memory

Question

I need to perform a summation of the kind i<j on symmetric matrices. This is equivalent to sum over the upper triangular elements of a matrix, diagonal excluded.

Given A a symmetric N x N array, the simplest solution is np.triu(A,1).sum() however I was wondering if faster methods exist that require less memory. It seems that (A.sum() - np.diag(A).sum())/2 is faster on large array, but how to avoid creating even the N x 1 array from np.diag? A doubly nested for loop would require no additional memory, but it is clearly not the way to go in Python.

``np.diag`` returns only a view to the original array in newer numpy versions, so you do not create an extra memory requirement. — unlut, Feb 11 '19 at 11:34
From some benchmarks it appears that indeed using diag(A).sum() and np.trace(A) have very similar runtimes. — linello, Feb 11 '19 at 11:36

score 3 · Accepted Answer · answered Feb 11 '19 at 11:33

3

You can replace np.diag(A).sum() with np.trace(A); this will not create the temporary Nx1 array

answered Feb 11 '19 at 11:33

BlackBear

22,411
10
48
86

Great idea, I did not think about it. – linello Feb 11 '19 at 11:34

score 3 · Answer 2 · answered Feb 11 '19 at 11:44

Adding my 2 cents to the ideas in other answers and comments, you might be interested in the following time performance for a 1000x1000 symmetric matrix. As you can see, the sum_diag method wins marginally for this case.

import numpy as np

N = 1000
a = np.random.randint(-2000,2000,size=(N,N))
A = (a + a.T)/2

def sum_triu(A):
    return np.triu(A,1).sum()

def sum_diag(A):
    return (A.sum() - np.diag(A).sum())/2

def sum_trace(A):
    return (A.sum() - np.trace(A))/2

%timeit sum_triu(A)
# 3.65 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum_diag(A)
# 663 µs ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sum_trace(A)
# 732 µs ± 120 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

score 1 · Answer 3 · answered Feb 11 '19 at 12:28

The fastest method with the least memory, in pure numpy is going to be to sum the entire thing and subtract the diagonal.

It may feel wasteful in terms of FLOPS, but note that the theoretical savings relative to that implementation are only a factor 2. If that means anything to you, you probably should not be using numpy in the first place.

Also, numpy fundamentally deals with blocks of memory addressable as strided views. If you could get a single strided view onto your triangle, it might lead to an efficient numpy implementation. But you cant (proof left as exercise to the reader), so you can safely forget about any true numpy solution that isnt a call to an optimized C-routine that solves your problem for you. And none exist that I am aware.

But even that 'optimized' C loop may in practice get its ass kicked by A.sum(). If A is contiguous, that sum has the potential to dispatch a maximally cache-optimized and SIMD-optimized codepath. Likely, any vanilly-C youd write yourself would get absolutely demolished by A.sum() in a benchmark.

score 0 · Answer 4 · answered Feb 11 '19 at 11:38

You could use Einstein notation to sum over the diagonal: np.einsum('ii', a) is equivalent to np.diag(a).sum(). For benchmarking purposes:

import numpy as np
a = np.arange(25).reshape(5, 5)
%timeit np.einsum('ii', a)
1.72 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.diag(a).sum()
3.93 µs ± 29.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Fastest way in numpy to sum over upper triangular elements with the least memory

4 Answers4

Linked