I have been a frequent lurker on Stack Overflow for some time and I tend to find very useful and clear information from here whenever I have coding questions. However, I can't really seem to find a thread that addresses my specific inquiry today.
Earlier today, I learned about vectorizing functions in Python in order to speed up computing time. I am currently trying to optimize a python program that I had written a little over a month ago. My program takes a text file containing data in the following format:
<magnitude> <dmagnitude> <exposure_number>
I then assign each column to lists mag
, dmag
, and expnum
.
What I want to do is create a 2d array of the mag
and dmag
values that share the same expnum
(having the same exposure number means that the mag
and dmag
point to the same data point).
I do this for all exposure numbers and, at the end, I take the median of the mag
and dmag
, and the standard deviation of the mag
for each of the exposure number-based arrays and combine them all into one array that I can plot.
Currently, I have the following code:
from numpy import loadtxt,array,asarray,append,std,median,empty,take
data = loadtxt(infile,usecols=(0,1,2))
mag = data1[:,2].tolist()
dmag = data1[:,3].tolist()
expnum = data1[:,4].tolist()
#initialize variables
indexing = list()
master_mag = list()
master_dmag = list()
sub_mag = list()
sub_dmag = list()
mag_std = array([])
mag_stdmed = array([])
mag_med = array([])
while len(mag) > 0:
num=expnum[0]
for i in range(0,len(expnum)):
if expnum[i] == num:
sub_mag.append(mag[i])
sub_dmag.append(dmag[i])
indexing.append(i)
#add the sub lists to their master lists
master_mag.append(sub_mag)
master_dmag.append(sub_dmag)
sub_mag=list()
sub_dmag=list()
#remove from mag, dmag, and expnum the index referred to by indexing
while len(indexing) > 0:
mag.pop(indexing[-1])
dmag.pop(indexing[-1])
expnum.pop(indexing[-1])
indexing.pop()
#make the master mag and dmag lists into numpy arrays
master_mag=asarray(master_mag)
master_dmag=asarray(master_dmag)
#generate the mag and dmag median and mag std arrays
for i in range(0,len(master_mag)):
mag_std=append(mag_std,std(master_mag[i]))
mag_med=append(mag_med,median(master_mag[i]))
mag_stdmed=append(mag_stdmed,median(master_dmag[i]))
#create empty numpy arrays to be used for mag med vs. mag std
#and mag med vs. dmag med
med_std=empty([0,2])
med_dmed=empty([0,2])
#fill in those arrays
for i in range(0,len(mag_std)):
med_std=append(med_std,[[mag_med[i],mag_std[i]]],axis=0)
med_dmed=append(med_dmed,[[mag_med[i],mag_stdmed[i]]],axis=0)
#sort the median mag and dmag standard deviation arrays by median mag
order_med_std=med_std[:,0].argsort()
order_med_dmed=med_dmed[:,0].argsort()
sorted_med_std=take(med_std,order_med_std,0)
sorted_med_dmed=take(med_dmed,order_med_dmed,0)
And then I'm ready to plot sorted_med_dmed[:,0]
vs. sorted_med_dmed[:,1]
and sorted_med_std[:,0]
vs. sorted_med_std[:,1]
This code works, it's just that I feel that it is too slow (especially when I get over 10,000 data points to work with). I want to try and vectorize this code to make it much quicker, but I have no idea where to begin.
I would like some help figuring out how to vectorize the matching-by-exposure-number component. I was thinking of creating a multi-dimensional array at the start that has the format: array([[[mag],[dmag]],...])
and a length equal to the number of different exposure numbers. Is there a way to generate and update an array like this in-line, without having to use a ton of loops?
Please let me know if you need any further clarity on what exactly this code is doing.
Thank you for your time.