How to spped up a nested loop where the second loop depend on the first?

Question

Consider the following code:

n = 20000


def f(i, j):

    return (i+1j*j)/(i-1j*j+1)  # a sample function, not necessary this form


lst = []
for i in range(n):
    for j in range(i, n):
        lst.append((i, j, f(i, j)))

since the loop is very large, I want to vectorize or speed it up. Reading other post, it seems that itertools.product can speed up loop, but in my case the second loop depend on the first, it seems I can't simply use it. Then how to speed it up?

I can, for example, use 4 processors.

`itertools.product` doesn't really speed up loop by much. It provide a convenient way to write nested loop as a single one. If you cannot vectorize your function, your best bet is working with something like `numba`. — Quang Hoang, Feb 24 '21 at 14:36
One suggestion: don't grow the list dynamically. See: https://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower — Evan Rosica, Feb 24 '21 at 14:36
By the way, `itertools.combinations` is somewhat equivalent to your double loop, not `itertools.product`. — Quang Hoang, Feb 24 '21 at 14:38
@anoffercan'trefuse That's too hard of a question. Most people would say **No way**. You have to know what your function does. Even then, vectorization might not be trivial, if at all possible. — Quang Hoang, Feb 24 '21 at 14:43
@QuangHoang Thanks for mentioning. I noticed combinations, which speed up a lit bit but not much. — an offer can't refuse, Feb 24 '21 at 14:43

score 1 · Answer 1 · answered Feb 24 '21 at 14:45

1

In general, you can do it with np.triu_indices:

i, j = np.triu_indices(n)
np.stack([i, j, f(i, j)])

This might choke your system (since i and j will have 200M elements each for n = 20000) in which case you'll need itertools

answered Feb 24 '21 at 14:45

Daniel F

13,620
2
29
55

Your solution is fast. Does it require `f` to be a "vectorized function"? – an offer can't refuse Feb 24 '21 at 15:01
yes, it won't work if `f` isn't vectorized – Daniel F Feb 24 '21 at 15:06

napuzba · Answer 2 · 2021-02-24T14:49:22.290

0

You can use:

lst = list( map ( 
  lambda item: (item[0],item[1], f(item[0],item[1] )), 
  filter( lambda item: item[0] >= item[1] ,  itertools.product(range(n),range(n))) 
))

However, I do not see how it can speed up loop since itertools.product uses a nested loop.

edited Feb 24 '21 at 14:49

answered Feb 24 '21 at 14:38

napuzba

6,033
3
21
32

Evan Rosica · Answer 3 · 2021-02-24T14:43:50.857

0

Suggestions:

Don't grow the list dynamically. See: Is there a way to circumvent Python list.append() becoming progressively slower in a loop as the list grows?, and Create an empty list in python with certain size.
Instead of a list, allocate a blank numpy array of the desired size, and fill it using your function.
Divide the input range into subsets, and create subprocesses to process each subset. For instance, process 1 handles 0-10000, and process 2 handles 10001-20000. The main process waits for the child processes to exit, then combines the results.

edited Feb 24 '21 at 14:43

answered Feb 24 '21 at 14:41

Evan Rosica

1,182
1
12
22

Item 1 - Yes. Item 2 - No. I believe manually edit/update numpy array elements would be slower than updating list elements. – Quang Hoang Feb 24 '21 at 14:42
@QuangHoang Ill try it and update the answer – Evan Rosica Feb 24 '21 at 14:45

Valdi_Bo · Answer 4 · 2021-02-24T15:06:40.200

I'll present an example for smaller value of n:

n = 4

First create a full array with results of your function:

arr = np.fromfunction(f, (n,n), dtype='complex')

So far the result (with reduced precision) is:

array([[ 0.     +0.j, -0.5    +0.5j    , -0.8    +0.4j    , -0.9    +0.3j    ],
       [ 0.5    +0.j,  0.2    +0.6j    , -0.25   +0.75j   , -0.53846+0.69231j],
       [ 0.66667+0.j,  0.5    +0.5j    ,  0.15385+0.76923j, -0.16667+0.83333j],
       [ 0.75   +0.j,  0.64706+0.41176j,  0.4    +0.7j    ,  0.12   +0.84j   ]])

Then, to generate your expected result, run:

result = [ [i,j, arr[i,j]] for i, j in zip(*np.triu_indices(n)) ]

The result is a list of lists containing:

[[0, 0, 0j]],
 [0, 1, (-0.5    +0.5j)]],
 [0, 2, (-0.8    +0.4j)],
 [0, 3, (-0.89999+0.3j)],
 [1, 1, ( 0.2    +0.6j)],
 [1, 2, (-0.25   +0.75j)],
 [1, 3, (-0.53846+0.69231j)],
 [2, 2, ( 0.15385+0.76923j)],
 [2, 3, (-0.16667+0.83333j)],
 [3, 3, ( 0.12   +0.84j)]]

(I also reduced the precision).

If you run into problems concerning available memory, then don't create any temporary array, but run instead:

result = [ [i, j, f(i,j)] for i, j in zip(*np.triu_indices(n)) ]

But this variant will run significantly slower than using np.fromfunction.

How much memory does `arr = np.fromfunction(f, (n,n), dtype='complex')` take compared to a same sized complex number array? — an offer can't refuse, Feb 24 '21 at 15:08

Yacola · Answer 5 · 2021-02-25T00:32:14.070

You should be able to take advantage of numba's capabilities for this kind of work:

import numpy as np
from numba import jit, prange

n = 10000
@jit(nopython=True)
def f(i, j):
    return (i+1j*j)/(i-1j*j+1)  # a sample function, not necessary this form

def loops_lst(n):
    lst = []
    for i in range(n):
        for j in range(i, n):
            lst.append((i, j, f(i, j)))
    return np.asarray(lst)
res1 = loops_lst(n)

def loops_np(n): # solution by @Daniel F
    i, j = np.triu_indices(n)
    out = np.stack([i, j, f(i, j)])
    return out.T
res2 = loops_np(n)

@jit(nopython=True,parallel=True)
def loops_nb(n):
    dim = np.sum(np.arange(1, n+1))
    out_idx = np.empty((dim,2),dtype=np.int64)
    out_f = np.empty((dim),dtype=np.complex64)
    for i in prange(n):
        for j in range(i, n):
            idx = int((n*(n-1))/2 - ((n-i)*(n-i-1))/2+j)
            out_idx[idx,0] = i
            out_idx[idx,1] = j
            out_f[idx] = f(i, j)
    return out_idx, out_f

res3_ = loops_nb(n)
res3 = np.hstack((res3_[0], res3_[1][:,None]))

Here is the sanity check:

np.allclose(res1,res2)
>>> True
np.allclose(res1,res3)
>>> True

There follow the timings for n=10000:

# numpy triu_indices
%timeit res2 = loops_np(n)
2.32 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# numba
# parallel = False (and simple range instead of numba.prange)
%timeit res3_ = loops_nb(n);res3 = np.hstack((res3_[0], res3_[1][:,None]))
2.13 s ± 47.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# parallel = True
%timeit res3_ = loops_nb(n);res3 = np.hstack((res3_[0], res3_[1][:,None]))
1.46 s ± 71.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So if there is a possibility to njit your true function f, you should definitely take a look at the numba solution.

How to spped up a nested loop where the second loop depend on the first?

5 Answers5