1

I have been a frequent lurker on Stack Overflow for some time and I tend to find very useful and clear information from here whenever I have coding questions. However, I can't really seem to find a thread that addresses my specific inquiry today.

Earlier today, I learned about vectorizing functions in Python in order to speed up computing time. I am currently trying to optimize a python program that I had written a little over a month ago. My program takes a text file containing data in the following format:

<magnitude> <dmagnitude> <exposure_number>

I then assign each column to lists mag, dmag, and expnum.

What I want to do is create a 2d array of the mag and dmag values that share the same expnum (having the same exposure number means that the mag and dmag point to the same data point).

I do this for all exposure numbers and, at the end, I take the median of the mag and dmag, and the standard deviation of the mag for each of the exposure number-based arrays and combine them all into one array that I can plot.

Currently, I have the following code:

from numpy import loadtxt,array,asarray,append,std,median,empty,take

data = loadtxt(infile,usecols=(0,1,2))
mag = data1[:,2].tolist() 
dmag = data1[:,3].tolist() 
expnum = data1[:,4].tolist() 

#initialize variables
indexing = list() 
master_mag = list() 
master_dmag = list() 
sub_mag = list() 
sub_dmag = list() 

mag_std = array([]) 
mag_stdmed = array([]) 
mag_med = array([])  

while len(mag) > 0: 
    num=expnum[0] 
    for i in range(0,len(expnum)): 
        if expnum[i] == num: 
            sub_mag.append(mag[i]) 
            sub_dmag.append(dmag[i]) 
            indexing.append(i) 

    #add the sub lists to their master lists
    master_mag.append(sub_mag) 
    master_dmag.append(sub_dmag) 
    sub_mag=list() 
    sub_dmag=list()

    #remove from mag, dmag, and expnum the index referred to by indexing
    while len(indexing) > 0:    
        mag.pop(indexing[-1]) 
        dmag.pop(indexing[-1]) 
        expnum.pop(indexing[-1]) 
        indexing.pop() 

#make the master mag and dmag lists into numpy arrays 
master_mag=asarray(master_mag) 
master_dmag=asarray(master_dmag) 

#generate the mag and dmag median and mag std arrays 
for i in range(0,len(master_mag)): 
    mag_std=append(mag_std,std(master_mag[i])) 
    mag_med=append(mag_med,median(master_mag[i])) 
    mag_stdmed=append(mag_stdmed,median(master_dmag[i])) 

#create empty numpy arrays to be used for mag med vs. mag std 
#and mag med vs. dmag med 
med_std=empty([0,2]) 
med_dmed=empty([0,2]) 

#fill in those arrays 
for i in range(0,len(mag_std)): 
    med_std=append(med_std,[[mag_med[i],mag_std[i]]],axis=0) 
    med_dmed=append(med_dmed,[[mag_med[i],mag_stdmed[i]]],axis=0) 

#sort the median mag and dmag standard deviation arrays by median mag 
order_med_std=med_std[:,0].argsort() 
order_med_dmed=med_dmed[:,0].argsort() 

sorted_med_std=take(med_std,order_med_std,0) 
sorted_med_dmed=take(med_dmed,order_med_dmed,0) 

And then I'm ready to plot sorted_med_dmed[:,0] vs. sorted_med_dmed[:,1] and sorted_med_std[:,0] vs. sorted_med_std[:,1]

This code works, it's just that I feel that it is too slow (especially when I get over 10,000 data points to work with). I want to try and vectorize this code to make it much quicker, but I have no idea where to begin.

I would like some help figuring out how to vectorize the matching-by-exposure-number component. I was thinking of creating a multi-dimensional array at the start that has the format: array([[[mag],[dmag]],...]) and a length equal to the number of different exposure numbers. Is there a way to generate and update an array like this in-line, without having to use a ton of loops?

Please let me know if you need any further clarity on what exactly this code is doing.

Thank you for your time.

Iggy
  • 33
  • 3
  • I'm afraid it's a hard to work out exactly what calculations you are trying to perform. If you were to write down in equation form med_dmed=..., then chances are that the appropriate numpy expression would be very similar. – mdurant Aug 12 '14 at 00:00
  • @Iggy I think you need to explain more what exactly you are trying to do. There is a better way of doing loop with the faster performance and you can optimize the speed of your code by using `multiprocessing`. You can look at [this example](http://stackoverflow.com/questions/5549190/is-shared-readonly-data-copied-to-different-processes-for-python-multiprocessing/5550156#5550156) and figure out how you can re-write your code in this frame. – Dalek Aug 12 '14 at 12:19

1 Answers1

2

First step in any problem like this should always be profiling. I recommend trying line_profiler, because it makes it easy to find hotspots visually. (You can also try Python's built-in profiler, but I find its output harder to parse.)

That should give you an idea of which parts are slowing your code down the most. Without trying it myself, I can offer a few bits of advice:

  • Try to avoid using Python lists when numpy arrays will suffice. Lists are fast if you're doing lots of appends, but slow for most other operations, and they don't support vectorization.
  • Relatedly, try to avoid calling numpy.append if you can. Each call involves allocating more memory and copying, which can be quite slow in a loop.
  • Use dictionaries to group data by keys. I find the stdlib collections.defaultdict quite useful for grouping like this:

    groups = defaultdict(list)
    for a,b,key in data:
      groups[key].append((a,b))
    
  • Use numpy's auto-vectorized function calls instead of calling functions in a loop. For example, this code:

    #generate the mag and dmag median and mag std arrays 
    for i in range(0,len(master_mag)): 
      mag_std=append(mag_std,std(master_mag[i])) 
      mag_med=append(mag_med,median(master_mag[i])) 
      mag_stdmed=append(mag_stdmed,median(master_dmag[i]))
    

    will be much faster when written as:

    mag_std = numpy.std(master_mag, axis=0)
    mag_meg = numpy.median(master_mag, axis=0)
    mag_stdmed = numpy.median(master_dmag, axis=0)
    
perimosocordiae
  • 17,287
  • 14
  • 60
  • 76