2

I have a 3 dimensional array (np.ndarray) which has mostly 0 in it. Now I want to sum them over the first dimension, but this is rather slow. I have looked into csr_matrix, but csr does not support 3 dimensional arrays. Is there a quicker way to sum an almost sparse nd array? Below is an excerpt from my current code.

Related question: sparse 3d matrix/array in Python? (creates a home made sparse ndarray Class, overkill?)

r = np.array([  [[1, 0, 0, 0],
                 [1, 0, 0, 0],
                 [0, 0, 1, 0]],

                [[0, 1, 0, 0],
                 [0, 0, 0, 1],
                 [0, 0, 2, 0]],

                [[0, 1, 0, 0],
                 [0, 0, 0, 0],
                 [0, 0, 0, 0]],

                [[0, 0, 0, 1],
                 [0, 0, 0, 0],
                 [0, 0, 0, 0]]], dtype=int)
np.sum(r,axis=0)
Out[35]: 
array([[1, 2, 0, 1],
       [1, 0, 0, 1],
       [0, 0, 3, 0]])

EDIT

After hpaulj's answer below, I did some more timing tests, see below. It seems that reshaping does not do a lot of good for the sum, while transforming them into csr_matrix and back to numpy kills performance. I am still thinking about using the indices directly (below called rand_persons, rand_articles and rand_days, since also in my original problem, I make the big ndarray using these indices.

from timeit import timeit
from scipy.sparse import csr_matrix
import numpy as np

def create_test_data():
    '''
    dtype = int64
    1% nonzero, 1000x1000x100: 1.3 s, 
    1% nonzero, 10000x1000x100: 13.3 s
    0.1% nonzero, 10000x1000x100: 2.7 s
    1ppm nonzero, 10000x1000x100: 0.007 s
    '''
    global purchases
    N_persons = 10000
    N_articles = 1000
    N_days = 100
    purchases = np.zeros(shape=(N_days, N_persons, N_articles), dtype=int)
    N_elements = N_persons * N_articles * N_days
    rand_persons = np.random.choice(a=range(N_persons), size=N_elements / 1e6, replace=True)
    rand_articles = np.random.choice(a=range(N_articles), size=N_elements / 1e6, replace=True)
    rand_days = np.random.choice(a=range(N_days), size=N_elements / 1e6, replace=True)
    for (i, j, k) in zip(rand_persons, rand_articles, rand_days):
        purchases[k, i, j] += 1

def sum_over_first_dim_A():
    '''
    0.1% nonzero, 10000x1000x99: 1.57s (average over 10)
    1ppm nonzero, 10000x1000x99: 1.70s (average over 10)
    '''
    global purchases
    d = purchases[:99, :, :]
    np.sum(d, axis=0)
def sum_over_first_dim_B():
    '''
    0.1% nonzero, 10000x1000x99: 1.55s (average over 10)
    1ppm nonzero, 10000x1000x99: 1.37s (average over 10)
    '''
    global purchases
    d = purchases[:99, :, :]
    (N_days, N_persons, N_articles) = d.shape 
    d.reshape(N_days, -1).sum(0).reshape(N_persons, N_articles) 
def sum_over_first_dim_C():
    '''
    0.1% nonzero, 10000x1000x99: 7.54s (average over 10)
    1ppm nonzero, 10000x1000x99: 7.44s (average over 10)
    '''
    global purchases
    d = purchases[:99, :, :]
    (N_days, N_persons, N_articles) = d.shape 
    r = csr_matrix(d.reshape(N_days, -1))
    t = r.sum(axis=0)
    np.reshape(t, newshape=(N_persons, N_articles))

if __name__ == '__main__':
    print (timeit(create_test_data, number=10))
    print (timeit(sum_over_first_dim_A, number=10))
    print (timeit(sum_over_first_dim_B, number=10))
    print (timeit(sum_over_first_dim_C, number=10))

EDIT 2

I now found a faster way to do the summation: I make a numpy array with sparse matrices. However, there is still some time in the initial creation of these matrices. I now do this with a loop. Is there a way to speed this up?

def create_test_data():
    [ ... ]
    '''
    0.1% nonzero, 10000x1000x100: 2.1 s
    1ppm nonzero, 10000x1000x100: 0.45 s
    '''
    global sp_purchases
    sp_purchases = np.empty(N_days, dtype=lil_matrix)
    for i in range(N_days):
        sp_purchases[i] = lil_matrix((N_persons, N_articles))
    for (i, j, k) in zip(rand_persons, rand_articles, rand_days):
        sp_purchases[k][i, j] += 1

def sum_over_first_dim_D():
    '''
    0.1% nonzero, 10000x1000x99: 0.47s (average over 10)
    1ppm nonzero, 10000x1000x99: 0.41s (average over 10)
    '''
    global sp_purchases
    d = sp_purchases[:99]
    np.sum(d)
Community
  • 1
  • 1
physicalattraction
  • 6,485
  • 10
  • 63
  • 122
  • Numpy [masked arrays](http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html) could help, though they're not an efficient way to store sparse matrices, so depending on how large and sparse your array is, it might not be useful. What is the starting format of the data? If it's already in an ndarray, then I think converting it to a sparse format and summing probably won't be much faster. – hunse Feb 17 '15 at 16:56
  • The original starting data structure is three arrays of (i,j,k) indices and one equally long array of values. I made a numpy ndarray from this, which is fast, since it is sparse (~1 in 1,000,000 elements is nonzero). – physicalattraction Feb 18 '15 at 08:02

2 Answers2

1

You could reshape the array so it is 2d, do the sum, and then shape back

r.reshape(4,-1).sum(0).reshape(3,4)   # == r.sum(0)

That reshaping does not add much processing time. And you can convert that 2d to sparse, and see if that saves any time. My guess is that your array will have to be very large, and very sparse, to beat the straight numpy sum. If you have other reasons to use the sparse format it may be worth it, but simply to do this sum, no. But test it yourself.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

Since your data is already in sparse format (indices and values), you can do the sum yourself. Just create an array that is the size of the final summed array, and loop over the indices, summing the corresponding values into the right slots. The sum2d function below shows how you would do it given that you're summing over the first dimension:

import timeit
import numpy as np

n = 1000
s = 1000
inds = np.random.randint(0, n, size=(s, 3))
vals = np.random.normal(size=s)


def sum3d():
    a = np.zeros((n, n, n))
    for [i, j, k], v in zip(inds, vals):
        a[i, j, k] = v

    return a.sum(axis=0)


def sum2d():
    b = np.zeros((n, n))
    for [i, j, k], v in zip(inds, vals):
        b[j, k] += v

    return b


kwargs = dict(repeat=3, number=1)
print(min(timeit.repeat('sum3d()', 'from __main__ import sum3d', **kwargs)))
print(min(timeit.repeat('sum2d()', 'from __main__ import sum2d', **kwargs)))
assert np.allclose(sum3d(), sum2d())
hunse
  • 3,175
  • 20
  • 25