Speed Up a for Loop - Python

Question

I have a code that works perfectly well but I wish to speed up the time it takes to converge. A snippet of the code is shown below:

def myfunction(x, i):
    y = x + (min(0, target[i] - data[i, :]x))*data[i]/(norm(data[i])**2))
    return y
rows, columns = data.shape

start = time.time()
iterate = 0
iterate_count = []
norm_count = []
res = 5

x_not = np.ones(columns)
norm_count.append(norm(x_not))
iterate_count.append(0)

while res > 1e-8:
    for row in range(rows):
        y = myfunction(x_not, row)
        x_not = y
    iterate += 1
    iterate_count.append(iterate)
    norm_count.append(norm(x_not))
    res = abs(norm_count[-1] - norm_count[-2])

print('Converge at {} iterations'.format(iterate))
print('Duration: {:.4f} seconds'.format(time.time() - start))

I am relatively new in Python. I will appreciate any hint/assistance.

Ax=b is the problem we wish to solve. Here, 'A' is the 'data' and 'b' is the 'target'

What's in `myfunction`? _That's_ what you need to optimize. The rest of the code just does simple arithmetic and appending to lists — Pranav Hosangadi, Oct 15 '20 at 15:16
Thanks @PranavHosangadi 'myfunction' is just a one line formular as well. I had thought there would be a better way to handle the for and while loops — MaliMali, Oct 15 '20 at 15:46
There is, depending on what's in `myfunction`. Look up vectorization. If you can express your problem as a vector/matrix calculation, you can use numpy's functions that implement SIMD operations to get a significant speedup. — Pranav Hosangadi, Oct 15 '20 at 15:48
@PranavHosangadi do you think this is feasible in a function of this type: 'def myfunction(x, i): y = x + (min(0, target[i] - data[i, :] @ x))*(data[i] / (norm(data[i])**2)) return y' — MaliMali, Oct 15 '20 at 16:04
Also, please explain what that function is supposed to do. What is `data`? What shape does it have? What about `target`? — Pranav Hosangadi, Oct 15 '20 at 16:07
Is this an exercise or are you looking for the best way to do this? You're solving a linear system of equations, there are preexisting ways to do this if you're looking for the best solution. — Pranav Hosangadi, Oct 15 '20 at 16:16
I know there are best ways to get it done but I am just trying something different. — MaliMali, Oct 15 '20 at 16:20
FYI: This is going to take a while, it isn't as straightforward as I originally thought — Pranav Hosangadi, Oct 15 '20 at 17:26
I do not mind to chill till you can have time. I would be glad to learn from you with this. — MaliMali, Oct 15 '20 at 17:53

Pranav Hosangadi · Answer 1 · 2020-10-15T22:07:23.637

Ugh! After spending a while on this I don't think it can be done the way you've set up your problem. In each iteration over the row, you modify x_not and then pass the updated result to get the solution for the next row. This kind of setup can't be vectorized easily. You can learn the thought process of vectorization from the failed attempt, so I'm including it in the answer. I'm also including a different iterative method to solve linear systems of equations. I've included a vectorized version -- where the solution is updated using matrix multiplication and vector addition, and a loopy version -- where the solution is updated using a for loop to demonstrate what you can expect to gain.

1. The failed attempt

Let's take a look at what you're doing here.

def myfunction(x, i): 
    y = x + (min(0, target[i] - data[i, :] @ x)) * (data[i] / (norm(data[i])**2)) 
    return y

You subtract
- the dot product of (the ith row of data and x_not)
- from the ith row of target,
- limited at zero.
You multiply this result with the ith row of data divided my the norm of that row squared. Let's call this part2
Then you add this to the ith element of x_not

Now let's look at the shapes of the matrices.

data is (M, N).
target is (M, ).
x_not is (N, )

Instead of doing these operations rowwise, you can operate on the entire matrix!

1.1. Simplifying the dot product.

Instead of doing data[i, :] @ x, you can do data @ x_not and this gives an array with the ith element giving the dot product of the ith row with x_not. So now we have data @ x_not with shape (M, )

Then, you can subtract this from the entire target array, so target - (data @ x_not) has shape (M, ).

So far, we have

part1 = target - (data @ x_not)

Next, if anything is greater than zero, set it to zero.

part1[part1 > 0] = 0

1.2. Finding rowwise norms.

Finally, you want to multiply this by the row of data, and divide by the square of the L2-norm of that row. To get the norm of each row of a matrix, you do

rownorms = np.linalg.norm(data, axis=1)

This is a (M, ) array, so we need to convert it to a (M, 1) array so we can divide each row. rownorms[:, None] does this. Then divide data by this.

part2 = data / (rownorms[:, None]**2)

1.3. Add to `x_not`

Finally, we're adding each row of part1 * part2 to the original x_not and returning the result

result = x_not + (part1 * part2).sum(axis=0)

Here's where we get stuck. In your approach, each call to myfunction() gives a value of part1 that depends on target[i], which was changed in the last call to myfunction().

2. Why vectorize?

Using numpy's inbuilt methods instead of looping allows it to offload the calculation to its C backend, so it runs faster. If your numpy is linked to a BLAS backend, you can extract even more speed by using your processor's SIMD registers

The conjugate gradient method is a simple iterative method to solve certain systems of equations. There are other more complex algorithms that can solve general systems well, but this should do for the purposes of our demo. Again, the purpose is not to have an iterative algorithm that will perfectly solve any linear system of equations, but to show what kind of speedup you can expect if you vectorize your code.

Given your system

data @ x_not = target

Let's define some variables:

A = data.T @ data
b = data.T @ target

And we'll solve the system A @ x = b

x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A @ x
p = resid

while (np.abs(resid) > tolerance).any():
        Ap = A @ p
        alpha = (resid.T @ resid) / (p.T @ Ap)
        x = x + alpha * p
        resid_new = resid - alpha * Ap
        beta = (resid_new.T @ resid_new) / (resid.T @ resid)
        p = resid_new + beta * p
        resid = resid_new + 0

To contrast the fully vectorized approach with one that uses iterations to update the rows of x and resid_new, let's define another implementation of the CG solver that does this.

def solve_loopy(data, target, itermax = 100, tolerance = 1e-8):
    A = data.T @ data
    b = data.T @ target
    
    rows, columns = data.shape
        
    x = np.zeros((columns,)) # Initial guess. Can be anything
    resid = b - A @ x
    resid_new = b - A @ x
    p = resid
    
    niter = 0
    while (np.abs(resid) > tolerance).any() and niter < itermax:
        Ap = A @ p
        alpha = (resid.T @ resid) / (p.T @ Ap)
        for i in range(len(x)):
            x[i] = x[i] + alpha * p[i]
            resid_new[i] = resid[i] - alpha * Ap[i] 
        # resid_new = resid - alpha * A @ p
        beta = (resid_new.T @ resid_new) / (resid.T @ resid)
        p = resid_new + beta * p
        resid = resid_new + 0
        niter += 1
    
    return x

And our original vector method:

def solve_vect(data, target, itermax = 100, tolerance = 1e-8):
    
    A = data.T @ data
    b = data.T @ target
    
    rows, columns = data.shape

    x = np.zeros((columns,)) # Initial guess. Can be anything
    resid = b - A @ x
    resid_new = b - A @ x
    p = resid
    
    niter = 0
    while (np.abs(resid) > tolerance).any() and niter < itermax:
        Ap = A @ p
        alpha = (resid.T @ resid) / (p.T @ Ap)
        x = x + alpha * p
        resid_new = resid - alpha * Ap
        beta = (resid_new.T @ resid_new) / (resid.T @ resid)
        p = resid_new + beta * p
        resid = resid_new + 0
        niter += 1
    
    return x

Let's solve a simple system to see if this works first:

2x1 +   x2  =   -5
−x1 +   x2  =   -2

should give a solution of [-1, -3]

data = np.array([[ 2,  1],
       [-1,  1]])
target = np.array([-5, -2])
print(solve_loopy(data, target))
print(solve_vect(data, target))

Both give the correct solution [-1, -3], yay! Now on to bigger things:

data = np.random.random((100, 100))
target = np.random.random((100, ))

Let's ensure the solution is still correct:

sol1 = solve_loopy(data, target)
np.allclose(data @ sol1, target)
# Output: False

sol2 = solve_vect(data, target)
np.allclose(data @ sol2, target)
# Output: False

Hmm, looks like the CG method doesn't work for badly conditioned random matrices we created. Well, at least both give the same result.

np.allclose(sol1, sol2)
# Output: True

But let's not get discouraged! We don't really care if it works perfectly, the point of this is to demonstrate how amazing vectorization is. So let's time this:

import timeit

timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
# Output: 0.25586539999994784

timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
# Output: 0.12008900000000722

Nice! A ~2x speedup simply by avoiding a loop while updating our solution!

For larger systems, this will be even better.

for N in [10, 50, 100, 500, 1000]:
    data = np.random.random((N, N))
    target = np.random.random((N, ))

    t_loopy = timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
    t_vect = timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
    print(N, t_loopy, t_vect, t_loopy/t_vect)

This gives us:

    N     t_loopy      t_vect     speedup
00010    0.002823    0.002099    1.345390
00050    0.051209    0.014486    3.535048
00100    0.260348    0.114601    2.271773
00500    0.980453    0.240151    4.082644
01000    1.769959    0.508197    3.482822

Thank you so much. I should try the loop and vectorize approach on GMRES to appreciate this. I appreciate your time. — MaliMali, Oct 15 '20 at 23:09