Need help vectorizing code or optimizing

Question

I am trying to do a double integral by first interpolating the data to make a surface. I am using numba to try and speed this process up, but it's just taking too long.

Here is my code, with the images needed to run the code located at here and here.

How long is it taking now? What would be an acceptable outcome? — John Zwinck, Jul 08 '13 at 14:39
Um, more than 30 seconds for the nested for loops. So I'm on 0,1 for 30, then 0,2 for 30 ect. And it's a 2000x2000 matrix, so it would take years to run. So if it could run in a couple of days even, that would be amazing. Just looking for shorter really — Wesley Bowman, Jul 08 '13 at 14:47
Each iteration takes 340 seconds on my Macbook Air 1.6 GHz i5 without Numba. — John Zwinck, Jul 10 '13 at 13:34

score 8 · Accepted Answer · answered Jul 13 '13 at 04:12

Noting that your code has a quadruple-nested set of for loops, I focused on optimizing the inner pair. Here's the old code:

for i in xrange(K.shape[0]):
    for j in xrange(K.shape[1]):

        print(i,j)
        '''create an r vector '''
        r=(i*distX,j*distY,z)

        for x in xrange(img.shape[0]):
            for y in xrange(img.shape[1]):
                '''create an ksi vector, then calculate
                   it's norm, and the dot product of r and ksi'''
                ksi=(x*distX,y*distY,z)
                ksiNorm=np.linalg.norm(ksi)
                ksiDotR=float(np.dot(ksi,r))

                '''calculate the integrand'''
                temp[x,y]=img[x,y]*np.exp(1j*k*ksiDotR/ksiNorm)

        '''interpolate so that we can do the integral and take the integral'''
        temp2=rbs(a,b,temp.real)
        K[i,j]=temp2.integral(0,n,0,m)

Since K and img are each about 2000x2000, the innermost statements need to be executed sixteen trillion times. This is simply not practical using Python, but we can shift the work into C and/or Fortran using NumPy to vectorize. I did this one careful step at a time to try to make sure the results will match; here's what I ended up with:

'''create all r vectors'''
R = np.empty((K.shape[0], K.shape[1], 3))
R[:,:,0] = np.repeat(np.arange(K.shape[0]), K.shape[1]).reshape(K.shape) * distX
R[:,:,1] = np.arange(K.shape[1]) * distY
R[:,:,2] = z

'''create all ksi vectors'''
KSI = np.empty((img.shape[0], img.shape[1], 3))
KSI[:,:,0] = np.repeat(np.arange(img.shape[0]), img.shape[1]).reshape(img.shape) * distX
KSI[:,:,1] = np.arange(img.shape[1]) * distY
KSI[:,:,2] = z

# vectorized 2-norm; see http://stackoverflow.com/a/7741976/4323                                                    
KSInorm = np.sum(np.abs(KSI)**2,axis=-1)**(1./2)

# loop over entire K, which is same shape as img, rows first                                                        
# this loop populates K, one pixel at a time (so can be parallelized)                                               
for i in xrange(K.shape[0]):                                                                                    
    for j in xrange(K.shape[1]):                                                                                

        print(i, j)

        KSIdotR = np.dot(KSI, R[i,j])
        temp = img * np.exp(1j * k * KSIdotR / KSInorm)

        '''interpolate so that we can do the integral and take the integral'''
        temp2 = rbs(a, b, temp.real)
        K[i,j] = temp2.integral(0, n, 0, m)

The inner pair of loops is now completely gone, replaced by vectorized operations done in advance (at a space cost linear in the size of the inputs).

This reduces the time per iteration of the outer two loops from 340 seconds to 1.3 seconds on my Macbook Air 1.6 GHz i5, without using Numba. Of the 1.3 seconds per iteration, 0.68 seconds are spent in the rbs function, which is scipy.interpolate.RectBivariateSpline. There is probably room to optimize further--here are some ideas:

Reenable Numba. I don't have it on my system. It may not make much difference at this point, but easy for you to test.
Do more domain-specific optimization, such as trying to simplify the fundamental calculations being done. My optimizations are intended to be lossless, and I don't know your problem domain so I can't optimize as deeply as you may be able to.
Try to vectorize the remaining loops. This may be tough unless you are willing to replace the scipy RBS function with something supporting multiple calculations per call.
Get a faster CPU. Mine is pretty slow; you can probably get a speedup of at least 2x simply by using a better computer than my tiny laptop.
Downsample your data. Your test images are 2000x2000 pixels, but contain fairly little detail. If you cut their linear dimensions by 2-10x, you'd get a huge speedup.

So that's it for me for now. Where does this leave you? Assuming a slightly better computer and no further optimization work, even the optimized code would take about a month to process your test images. If you only have to do this once, maybe that's fine. If you need to do it more often, or need to iterate on the code as you try different things, you probably need to keep optimizing--starting with that RBS function which consumes more than half the time now.

Bonus tip: your code would be a lot easier to deal with if it didn't have nearly-identical variable names like k and K, nor used j as a variable name and also as a complex number suffix (0j).

Thanks! I changed some names to make it less confusing. I usually do that (and add comments) but I was just so frustrated. Numba doesn't work with this, getting a weird JIT error, but it probably wouldn't speed it up much anyway. Numexpr might, and I should be able to easily parallelize the for loop since each pixel is independant.Unfortunately, I can't loss any information from the photos, this is an attempt to numerically reconstruct a hologram using DIHM (digital inline holography microscope). Could you think of a better way to do a double integral? than using RBS? that may reduce times. — Wesley Bowman, Jul 13 '13 at 13:29
Parallelizing to deal with all the pixels should be easily done in a number of ways--I should have mentioned that. Even simply using the basic `multiprocessing` module should give you a speedup nearly linear in the number of cores you have. My machine doesn't have a lot of cores, but given a 12-way machine you could finish the whole job in 3 days or so without further optimization. Regarding RBS, I am no expert in this area, but it strikes me that you are spending a fair amount of time to fit a curve to your data, only to compute the integral. What if you instead do a Riemann sum directly? — John Zwinck, Jul 13 '13 at 13:37
I will actually be using mpi4py so that I can run it on a cluster with hundreds of cores. multiprocessing doesn't work with clusters (unfortunately). The Riemann sum directly is what we initially wanted to do, but we thought that creating a surface for the data would help with integration. A 2d Riemann sum just seems like it would take a lot of time. Also, thank you for all of your help. I have learned a lot from the code. — Wesley Bowman, Jul 13 '13 at 13:45
You're welcome. It sounds like you're on the right track--just keep thinking about how to simplify and reduce the number of operations--especially pure Python ones like loops. A 2D Riemann sum may take a bit of time, but right now half the total time is spent doing the alternative, so Riemann might be faster, especially if you're willing to accept some loss of precision (which may not be a loss after all considering your original approach integrates a fitted curve and not the real data anyway). — John Zwinck, Jul 13 '13 at 14:07
I have actually now vectorized the dot product you have, now bring that out side of the for loop as well. Slowly working on getting rid of this for loop, just running into a lot of memory errors along the way. — Wesley Bowman, Jul 16 '13 at 17:19

Need help vectorizing code or optimizing

1 Answers1

Linked