I have a 3 dimensional array (np.ndarray) which has mostly 0 in it. Now I want to sum them over the first dimension, but this is rather slow. I have looked into csr_matrix, but csr does not support 3 dimensional arrays. Is there a quicker way to sum an almost sparse nd array? Below is an excerpt from my current code.
Related question: sparse 3d matrix/array in Python? (creates a home made sparse ndarray Class, overkill?)
r = np.array([ [[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 1, 0, 0],
[0, 0, 0, 1],
[0, 0, 2, 0]],
[[0, 1, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]],
[[0, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0]]], dtype=int)
np.sum(r,axis=0)
Out[35]:
array([[1, 2, 0, 1],
[1, 0, 0, 1],
[0, 0, 3, 0]])
EDIT
After hpaulj's answer below, I did some more timing tests, see below. It seems that reshaping does not do a lot of good for the sum, while transforming them into csr_matrix and back to numpy kills performance. I am still thinking about using the indices directly (below called rand_persons
, rand_articles
and rand_days
, since also in my original problem, I make the big ndarray using these indices.
from timeit import timeit
from scipy.sparse import csr_matrix
import numpy as np
def create_test_data():
'''
dtype = int64
1% nonzero, 1000x1000x100: 1.3 s,
1% nonzero, 10000x1000x100: 13.3 s
0.1% nonzero, 10000x1000x100: 2.7 s
1ppm nonzero, 10000x1000x100: 0.007 s
'''
global purchases
N_persons = 10000
N_articles = 1000
N_days = 100
purchases = np.zeros(shape=(N_days, N_persons, N_articles), dtype=int)
N_elements = N_persons * N_articles * N_days
rand_persons = np.random.choice(a=range(N_persons), size=N_elements / 1e6, replace=True)
rand_articles = np.random.choice(a=range(N_articles), size=N_elements / 1e6, replace=True)
rand_days = np.random.choice(a=range(N_days), size=N_elements / 1e6, replace=True)
for (i, j, k) in zip(rand_persons, rand_articles, rand_days):
purchases[k, i, j] += 1
def sum_over_first_dim_A():
'''
0.1% nonzero, 10000x1000x99: 1.57s (average over 10)
1ppm nonzero, 10000x1000x99: 1.70s (average over 10)
'''
global purchases
d = purchases[:99, :, :]
np.sum(d, axis=0)
def sum_over_first_dim_B():
'''
0.1% nonzero, 10000x1000x99: 1.55s (average over 10)
1ppm nonzero, 10000x1000x99: 1.37s (average over 10)
'''
global purchases
d = purchases[:99, :, :]
(N_days, N_persons, N_articles) = d.shape
d.reshape(N_days, -1).sum(0).reshape(N_persons, N_articles)
def sum_over_first_dim_C():
'''
0.1% nonzero, 10000x1000x99: 7.54s (average over 10)
1ppm nonzero, 10000x1000x99: 7.44s (average over 10)
'''
global purchases
d = purchases[:99, :, :]
(N_days, N_persons, N_articles) = d.shape
r = csr_matrix(d.reshape(N_days, -1))
t = r.sum(axis=0)
np.reshape(t, newshape=(N_persons, N_articles))
if __name__ == '__main__':
print (timeit(create_test_data, number=10))
print (timeit(sum_over_first_dim_A, number=10))
print (timeit(sum_over_first_dim_B, number=10))
print (timeit(sum_over_first_dim_C, number=10))
EDIT 2
I now found a faster way to do the summation: I make a numpy array with sparse matrices. However, there is still some time in the initial creation of these matrices. I now do this with a loop. Is there a way to speed this up?
def create_test_data():
[ ... ]
'''
0.1% nonzero, 10000x1000x100: 2.1 s
1ppm nonzero, 10000x1000x100: 0.45 s
'''
global sp_purchases
sp_purchases = np.empty(N_days, dtype=lil_matrix)
for i in range(N_days):
sp_purchases[i] = lil_matrix((N_persons, N_articles))
for (i, j, k) in zip(rand_persons, rand_articles, rand_days):
sp_purchases[k][i, j] += 1
def sum_over_first_dim_D():
'''
0.1% nonzero, 10000x1000x99: 0.47s (average over 10)
1ppm nonzero, 10000x1000x99: 0.41s (average over 10)
'''
global sp_purchases
d = sp_purchases[:99]
np.sum(d)