Python - parallelize a python loop for 2D masked array?

Question

Probably a commonplace question, but how can I parallelize this loop in Python?

for i in range(0,Nx.shape[2]):
  for j in range(0,Nx.shape[2]):
    NI=Nx[:,:,i]; NJ=Nx[:,:,j]
    Ku[i,j] = (NI[mask!=True]*NJ[mask!=True]).sum()

So my question: what's the easiest way to parallelize this code?

         ---------- EDIT LATER------------------

An example of data

import random
import numpy as np
import numpy.ma as ma
from numpy import unravel_index    

#my input
Nx = np.random.rand(5,5,5)  

#mask creation
mask_positions = zip(*np.where((Nx[:,:,0] < 0.4)))
mask_array_positions = np.asarray(mask_positions)
i, j = mask_array_positions.T
mask = np.zeros(Nx[:,:,0].shape, bool)
mask[i,j] = True

And i want to calculate Ku by parallelizing. My aim is to use the Ku array to solve a linear problem so i have to put the mask values apart (represent near the half of my array)

[multiprocessing.Pool](https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers) is a good starting place. — Brian Cain, Feb 27 '15 at 21:20
By the way, outside of implementing @hpaulj's excellent answer, you might want to consider that your mask, which you fabricate in 5 lines is equivalent to just writing `mask = Nx[:,:,0] < 0.4`. — Oliver W., Feb 27 '15 at 21:38

hpaulj · Accepted Answer · 2015-02-28T18:49:27.730

4

I think you want to 'vectorize', to use numpy terminology, not parallelize in the multiprocess way.

Your calculation is essentially a dot (matrix) product. Apply the mask once to the whole array to get a 2d array, NIJ. Its shape will be (N,5), where N is the number of True values in ~mask. Then it's just a (5,N) array 'dotted' with a (N,5) - ie. sum over the N dimension, leaving you with a (5,5) array.

NIJ = Nx[~mask,:]
Ku = np.dot(NIJ.T,NIJ)

In quick tests it matches the Ku produced by your double loop. Depending on the underlying library used for np.dot there might be some multicore calculation, but that's usually not a priority issue for numpy users.

Applying the large boolean mask is the most time consuming part of these calculations - both the vectorized and iterative versions.

For a mask with 400,000 True values, compare these 2 indexing times:

In [195]: timeit (NI[:400,:1000],NJ[:400,:1000])
100000 loops, best of 3: 4.87 us per loop
In [196]: timeit (NI[mask],NJ[mask])
10 loops, best of 3: 98.8 ms per loop

Selecting the same number of items with basic (slice) indexing is several orders of magnitude faster than advanced indexing with the mask.

Substituting np.dot(NI[mask],NJ[mask]) for (NI[mask]*NJ[mask]).sum() only saves a few ms.

edited Feb 28 '15 at 18:49

answered Feb 27 '15 at 21:26

hpaulj

221,503
14
230
353

I have to do this on great arrays and a great number of time and then i solve a system with these arrays with 'np.linalg.solve' I would like to accelerate it... – user3601754 Feb 27 '15 at 21:37
@user3601754 did you *try* this code? It's fast. Perform some timing comparisons if you need convincing. I know the history of your questions and you should definitely consider this solution before attempting to do multiprocessing. – Oliver W. Feb 27 '15 at 21:40
13.237869 secondes for my case and i m waiting the results since 2mins for this code – user3601754 Feb 27 '15 at 21:58
1

What size of arrays are you working with? And which takes more time, applying the mask or doing the `dot`? With large enough ones there might be a trade off between memory use and vectorization. – hpaulj Feb 27 '15 at 22:26
My Nx array has a shape : (1400, 1528, 20) and i have to repeat the calculation a lot of time! it seems to be the mask step – user3601754 Feb 27 '15 at 23:27
With that shape your iterative approach might well be faster. Making an array of that size and applying a mask produces an error (on my old linux machine): `ValueError: dimensions too large in fancy indexing`. Your `Ku` is only `(20,20)`, small compared to the other arrays. – hpaulj Feb 27 '15 at 23:38
I'd challenge that notion. I've done [the timings](http://pastebin.com/HdFvbzMG) on my raspberry pi (rev 2, model B), which has 512MB RAM & a single-core CPU of 700MHz: it's not because you had a ValueError that vectorization might be slower than the double for-loop, that depends entirely on the dataset and RAM. I'm hoping the OP would shed some light on how he does the timing tests, because I have the feeling they are not as we expect. @user3601754 can you explain *how* you time the code and what the output is when you run the code from the pastebin? – Oliver W. Feb 28 '15 at 01:18
I changed my `mask` so it only had 400,000 True values, and calculated `Ku` with my `dot` method - 600ms per loop. – hpaulj Feb 28 '15 at 05:53
For the same data, the OP loop method took 85 sec. – hpaulj Feb 28 '15 at 06:06
I got the time with : tic=time.clock() before the code and toc=time.clock() after the code and so : print 'Temps=', toc-tic. I m using this Ku array to solve a linear system (but that step is very fast) – user3601754 Feb 28 '15 at 09:17
So to go back to my first question...can we parallelize this loop in order to be faster? – user3601754 Feb 28 '15 at 09:19
@user3601754 I see you've finally accepted one of the answers. I'm assuming that means you are getting time improvements by running this code? Care to enlighten us why the timings you mentioned (your code: 13s vs vectorized-dot: 291s) were not at all what we expected? – Oliver W. Feb 28 '15 at 17:38
I added some timings to show that indexing with `mask` is the most time consuming part of these calculations. – hpaulj Feb 28 '15 at 18:50

score 1 · Answer 2 · edited May 23 '17 at 11:43

1

I'd like to extend @hpaulj's excellent answer (great analysis of the problem too by the way) for large matrices.

The operation

Ku = np.dot(NIJ.T,NIJ)

can be replaced by

Ku = np.einsum('ij,ik->jk', NIJ, NIJ)

It should also be noted that np.dot could fall back to slower routines if numpy was not compiled to use BLAS.

For a test matrix NIJ of shape (1250711, 50), I got 54.9 s with the dot method, while the einsum does it in 1.67 s. On my system, numpy is compiled with BLAS support.

Remark: np.einsum does not always outperform np.dot, a situation that became apparent on my system when you would compare any of the following

Nx = np.random.rand(1400,1528,20).astype(np.float16)
Nx = np.random.rand(1400,1528,20).astype(np.float32)

(or even a dtype of np.float64).

edited May 23 '17 at 11:43

Community

1
1

answered Feb 27 '15 at 22:48

Oliver W.

13,169
3
37
50

Thanks for your help! if i use your code, i got 263.650027 sec :) For the moment, my code is the better (in my case)! – user3601754 Feb 27 '15 at 23:21
@user3601754, you should consider hpaulj's comment: without more information about the size of the arrays you're working on (and possible the size of the RAM), it's hard to improve on this algorithm even more. Just FYI, using hpaulj's algorithm (without the `einsum`), on a matrix of `Nx = np.random.rand(500,5000,50)`, I saw a threefold speed improvement compared to your method. But such speed gains depend a lot on the dimensions of the arrays involved (is the last axis the biggest one?) – Oliver W. Feb 27 '15 at 23:29
Ok! my shape is (1400, 1528, 20) – user3601754 Feb 27 '15 at 23:33
and my type is float64 – user3601754 Feb 28 '15 at 11:29

Python - parallelize a python loop for 2D masked array?

2 Answers2

Linked