I'd like to know how I might be able to transform this problem to reduce the overhead of the np.sum()
function calls in my code.
I have an input
matrix, say of shape=(1000, 36)
. Each row represents a node in a graph. I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. Those "other" rows are defined in a dictionary nodes_nbrs
that records, for each row, a list of rows that must be summed together. An example is as such:
nodes_nbrs = {0: [0, 1],
1: [1, 0, 2],
2: [2, 1],
...}
Here, node 0
would be transformed into the sum of nodes 0
and 1
. Node 1
would be transformed into the sum of nodes 1
, 0
, and 2
. And so on for the rest of the nodes.
The current (and naive) way I currently have implemented is as such. I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the nodes_nbrs
dictionary:
output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
output[k] = np.sum(input[v], axis=0)
This code is all cool and fine in small tests (shape=(1000, 36)
), but on larger tests (shape=(~1E(5-6), 36)
), it takes ~2-3 seconds to complete. I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this.
After doing line profiling, I noticed that the key killer here is calling the np.sum
function over and over, which takes about 50% of the total time. Is there a way I can eliminate this overhead? Or is there another way I can optimize this?
Apart from that, here is a list of things I have done, and (very briefly) their results:
- A
cython
version: eliminates thefor
loop type checking overhead, 30% reduction in time taken. With thecython
version,np.sum
takes about 80% of the total wall clock time, rather than 50%. - Pre-declare
np.sum
as a variablenpsum
, and then callnpsum
inside thefor
loop. No difference with original. - Replace
np.sum
withnp.add.reduce
, and assign that to the variablenpsum
, and then callnpsum
inside thefor
loop. ~10% reduction in wall clock time, but then incompatible withautograd
(explanation below in sparse matrices bullet point). numba
JIT-ing: did not attempt more than adding decorator. No improvement, but didn't try harder.- Convert the
nodes_nbrs
dictionary into a densenumpy
binary array (1s and 0s), and then do a singlenp.dot
operation. Good in theory, bad in practice because it would require a square matrix ofshape=(10^n, 10^n)
, which is quadratic in memory usage.
Things I have not tried, but am hesitant to do so:
scipy
sparse matrices: I am usingautograd
, which does not support automatic differentiation of thedot
operation forscipy
sparse matrices.
For those who are curious, this is essentially a convolution operation on graph-structured data. Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge.