Python Numpy vectorize nested for-loops for combinatorics

Question

Given an nxn array A of real positive numbers, I'm trying to find the minimum of the maximum of the element-wise minimum of all combinations of three rows of the 2-d array. Using for-loops, that comes out to something like this:

import numpy as np

n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = np.inf

for i in range(n-2):
    for j in range(i+1, n-1):
        for k in range(j+1, n):
            # find the maximum of the element-wise minimum of the three vectors
            local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
            # if local_best is lower than global_best, update global_best
            if (local_best < global_best):
                global_best = local_best
                save_rows = [i, j, k]

print global_best, save_rows

In the case for n = 100, the output should be this:

Out[]: 0.492652949593 [6, 41, 58]

I have a feeling though that I could do this much faster using Numpy vectorization, and would certainly appreciate any help on doing this. Thanks.

Are you searching for something like that? https://stackoverflow.com/a/49821744/2504700 It is not exectly the same, but getting combinations of 3 instead of 2 will probably do the trick — OriolAbril, Apr 24 '18 at 03:54
It would help if you could give a small example input and output. — John Zwinck, Apr 24 '18 at 07:29
@JohnZwinck updated to a working stand-alone script. You can vary the value for n to test larger sized problems — ToneDaBass, Apr 24 '18 at 07:59

John Zwinck · Answer 1 · 2018-04-24T10:25:30.877

6

This solution is 5x faster for n=100:

coms = np.fromiter(itertools.combinations(np.arange(n), 3), 'i,i,i').view(('i', 3))
best = A[coms].min(1).max(1)
at = best.argmin()
global_best = best[at]
save_rows = coms[at]

The first line is a bit convoluted but turns the result of itertools.combinations into a NumPy array which contains all possible [i,j,k] index combinations.

From there, it's a simple matter of indexing into A using all the possible index combinations, then reducing along the appropriate axes.

This solution consumes a lot more memory as it builds the concrete array of all possible combinations A[coms]. It saves time for smallish n, say under 250, but for large n the memory traffic will be very high and it may be slower than the original code.

edited Apr 24 '18 at 10:25

answered Apr 24 '18 at 10:20

John Zwinck

239,568
38
324
436

Very interesting. Indeed, this is along the train of thought I was originally going for, but the memory issues do become a real problem with scale. I tried n=430 and the code crashed. It's a great example of how eventually memory can dominate algorithmic efficiency as one increases the problem size. – ToneDaBass Apr 24 '18 at 18:05
Before reading this answer I didn't know about the existence of `np.fromiter` now I'm a huge fan already – OriolAbril Apr 24 '18 at 18:22

OriolAbril · Answer 2 · 2018-04-24T21:53:55.297

Working by chunks allows to combine the speed of vectorized calculus while avoiding to run into Memory Errors. Below there is an example of converting the nested loops to vectorization by chunks.

Starting from the same variables as the question, a chunk length is defined, in order to vectorize calculations inside the chunk and loop only over chunks instead of over combinations.

chunk = 2000 # define chunk length, if to small, the code won't take advantage 
             # of vectorization, if it is too large, excessive memory usage will 
             # slow down execution, or Memory Error will be risen 
combinations = itertools.combinations(range(n),3) # generate iterator containing 
                                        # all possible combinations of 3 columns
N = n*(n-1)*(n-2)//6 # number of combinations (length of combinations cannot be 
                     # retrieved because it is an iterator)
# generate a list containing how many elements of combinations will be retrieved 
# per iteration
n_chunks, remainder = divmod(N,chunk)
counts_list = [chunk for _ in range(n_chunks)]
if remainder:
    counts_list.append(remainder)

# Iterate one chunk at a time, using vectorized code to treat the chunk
for counts in counts_list:
    # retrieve combinations in current chunk
    current_comb = np.fromiter(combinations,dtype='i,i,i',count=counts)\
                     .view(('i',3)) 
    # maximum of element-wise minimum in current chunk
    chunk_best = np.minimum(np.minimum(A[current_comb[:,0],:],A[current_comb[:,1],:]),
                            A[current_comb[:,2],:]).max(axis=1) 
    ravel_save_row = chunk_best.argmin() # minimum of maximums in current chunk
    # check if current chunk contains global minimum
    if chunk_best[ravel_save_row] < global_best: 
        global_best = chunk_best[ravel_save_row]
        save_rows = current_comb[ravel_save_row]
print(global_best,save_rows)

I ran some performance comparisons with the nested loops, obtaining the following results (chunk_length = 1000):

n=100
- Nested loops: 1.13 s ± 16.6 ms
- Work by chunks: 108 ms ± 565 µs
n=150
- Nested loops: 4.16 s ± 39.3 ms
- Work by chunks: 523 ms ± 4.75 ms
n=500
- Nested loops: 3min 18s ± 3.21 s
- Work by chunks: 1min 12s ± 1.6 s

Note

After profiling the code, I found that the np.min was what took longest by calling np.maximum.reduce. I converted it directly to np.maximum which improved performance a bit.

This is a great way to get the benefits of fast computation while avoiding the memory issues, I'm learning so much from this thread! — ToneDaBass, Apr 24 '18 at 21:43

Juan · Answer 3 · 2018-04-24T08:25:28.260

2

You can use combinations from itertools, that it's a python standard library, and it will help you to to remove all those nested loops.

from itertools import combinations
import numpy as np

n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = 1000000000000000.0

for i, j, k in combinations(range(n), 3):
    local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
    if local_best < global_best:
        global_best = local_best
        save_rows = [i, j, k]

print global_best, save_rows

edited Apr 24 '18 at 08:25

answered Apr 24 '18 at 08:20

Juan

1,520
2
19
31

1

True, this does make for much better looking code. But it runs at just about the same speed as the three nested loops. I'm wondering if there are any approaches that would run significantly faster. – ToneDaBass Apr 24 '18 at 08:35

max9111 · Accepted Answer · 2019-02-28T17:08:03.583

Don't try to vectorize loops that are not simple to vectorize. Instead use a jit compiler like Numba or use Cython. Vectorized solutions are good if the resulting code is more readable, but in terms of performance a compiled solution is usually faster or in a worst case scenario as fast as a vectorized solution (except BLAS routines).

Single-threaded example

import numba as nb
import numpy as np

#Min and max library calls may be costly for only 3 values
@nb.njit()
def max_min_3(A,B,C):
  max_of_min=-np.inf
  for i in range(A.shape[0]):
    loc_min=A[i]
    if (B[i]<loc_min):
      loc_min=B[i]
    if (C[i]<loc_min):
      loc_min=C[i]

    if (max_of_min<loc_min):
      max_of_min=loc_min

  return max_of_min

@nb.njit()
def your_func(A):
  n=A.shape[0]
  save_rows=np.zeros(3,dtype=np.uint64)
  global_best=np.inf
  for i in range(n):
      for j in range(i+1, n):
          for k in range(j+1, n):
              # find the maximum of the element-wise minimum of the three vectors
              local_best = max_min_3(A[i,:], A[j,:], A[k,:])
              # if local_best is lower than global_best, update global_best
              if (local_best < global_best):
                  global_best = local_best
                  save_rows[0] = i
                  save_rows[1] = j
                  save_rows[2] = k

  return global_best, save_rows

Performance of single-threaded version

n=100
your_version: 1.56s
compiled_version: 0.0168s (92x speedup)

n=150
your_version: 5.41s
compiled_version: 0.08122s (66x speedup)

n=500
your_version: 283s
compiled_version: 8.86s (31x speedup)

The first call has a constant overhead of about 0.3-1s. For performance measurement of the calculation time itself, call it once and then measure performance.

With a few code changes this task can also be parallelized.

Multi-threaded example

@nb.njit(parallel=True)
def your_func(A):
  n=A.shape[0]
  all_global_best=np.inf
  rows=np.empty((3),dtype=np.uint64)

  save_rows=np.empty((n,3),dtype=np.uint64)
  global_best_Temp=np.empty((n),dtype=A.dtype)
  global_best_Temp[:]=np.inf

  for i in range(n):
      for j in nb.prange(i+1, n):
          row_1=0
          row_2=0
          row_3=0
          global_best=np.inf
          for k in range(j+1, n):
              # find the maximum of the element-wise minimum of the three vectors

              local_best = max_min_3(A[i,:], A[j,:], A[k,:])
              # if local_best is lower than global_best, update global_best
              if (local_best < global_best):
                  global_best = local_best
                  row_1 = i
                  row_2 = j
                  row_3 = k

          save_rows[j,0]=row_1
          save_rows[j,1]=row_2
          save_rows[j,2]=row_3
          global_best_Temp[j]=global_best

      ind=np.argmin(global_best_Temp)
      if (global_best_Temp[ind]<all_global_best):
          rows[0] = save_rows[ind,0]
          rows[1] = save_rows[ind,1]
          rows[2] = save_rows[ind,2]
          all_global_best=global_best_Temp[ind]

  return all_global_best, rows

Performance of multi-threaded version

n=100
your_version: 1.56s
compiled_version: 0.0078s (200x speedup)

n=150
your_version: 5.41s
compiled_version: 0.0282s (191x speedup)

n=500
your_version: 283s
compiled_version: 2.95s (96x speedup)

Edit

In a newer Numba Version (installed through the Anaconda Python Distribution) I have to manually install tbb to get a working parallelization.

This is really great. While the overhead makes it slower compared to the "chunk" answer by @xg.plt.py for small problems, this penalty is minute compared to the massive benefits for large problems. I'm learning so much from this thread, thanks for the insight. — ToneDaBass, Apr 27 '18 at 05:43
Later versions of numba now seem to give errors with your parallel approach, do you know what the fix is? — ToneDaBass, Feb 27 '19 at 19:59
@ToneDaBass I got the same behaivour an a newer Numba version. Installing tbb threading backend should solve the problem. — max9111, Feb 28 '19 at 12:17
The problem seems to be due to the `save_rows[j,0]=row_1` part of the code, resulting in the following error `Cannot resolve setitem: array(uint64, 2d, C)[(int64, Literal[int](0))] = array(int64, 1d, C)` — ToneDaBass, Mar 07 '19 at 17:05
@ToneDaBass I also edited the code a bit (Inserting row_1=0,row_2=0,row_3=0) It looks like that without this edit Numba doesn't recognize the thread-local varibales row_1,row_2,row_3 — max9111, Mar 08 '19 at 08:11
I forgot to comment earlier thanking you for posting your modifications, they solved the issue perfectly. Thank you! — ToneDaBass, Apr 05 '19 at 17:51
I updated to the latest numba, and now I get the following error when I run the parallel code `Variable all_global_best.2 used in parallel loop may be written to simultaneously by multiple workers and may result in non-deterministic or unintended results.` — ToneDaBass, Apr 23 '20 at 01:19

Python Numpy vectorize nested for-loops for combinatorics

4 Answers4

Note

Linked