I am doing a tracking expirement:
Consider a trace of a positions (of a reflecting gold bead for example) in a field of view measured at different consecutive times points on a camera.
Now I'd like to write my own code for handling the data... My input is in the form of a numpy structured array and I take what I need from it.
Basically I start with a list with entries in the form of (position, time) for each point.
I iterate through all points and calculate the distance in space and time of each point in the data set to those points following after it in the measurement.
Depending on the amount of data, I end up with many hundreds of thousand (distance_space/distance_time) tuples. These I need to calculate so-called MSD curves (MSD: mean squared distance) where I plot the distance of a e.g. the gold bead from it's origin in space over the time that passed.
Now I don't only measure 1 trace but let's say 200... so I want to differentiate between the seperate traces also which opens up one more dimension.
I am able to extract the calculate datapoints into a HUGE list
MSD_list
with e.g. > 100.000.000 entries, each entry being a tuples of 3:
[(distance_space, distance_time, traceID),...,...]
That already takes some time but is still doable... I create a numpy array and store that list into it (takes ~20 seconds), working example below with the list containing 3 made-up tuples of 2 seperate traces, so very small amount of data...
import numpy as np
MSD_list = [(0,0,0),(0.5,1,0),(1,2,0),(0,0,1),(2,1,1),(4,2,1)]
MSD_traces = [[] for x in range(2)]
# MSD_traces is a list that collects the data later on in a nested loop.
MSD_array = np.zeros(len(MSD_list),dtype = [('d_space', 'f8'),('d_time', 'i4'), ('traceID', 'i4')])
MSD_array[:] = MSD_list
# Now I want to go into the array, select all the data that has a common traceID,
# then I select a subarray with same timelag and go into a second inner loop...
for i in range(2): #2 = the amount of unique traces in my example
for k in range(np.max(MSD_array[MSD_array['traceID']==i]['d_time'])+1):
#iterating from d_time 0 to the maximum d_time in that trace
if MSD_array[(MSD_array['traceID']==i) & (MSD_array['d_time']==k)]['d_space'].size:
#checking for empty array where np.mean() yields error
msd_mean = np.mean(MSD_array[(MSD_array['traceID']==i) & (MSD_array['d_time']==k)]['d_space'])
MSD_traces[i].append((msd_mean, k))
print(MSD_traces[0])
print(MSD_traces[1])
Output:
[(0.0, 0), (0.5, 1), (1.0, 2)]
[(0.0, 0), (2.0, 1), (4.0, 2)]
The code works fine. It just takes ages. I bet I use lists and numpy arrays sub-optimally
I the example I end up with a list that contains both traces MSD_traces[0] and MSD_traces[1], each containing the mean-distance and the corresponding time lag, respectively. In general, the length of the traces is not the same.
I already tried creating numpy arrays in the loop for each trace instead of using one that contains ALL traces... pretty much the same result with performance
I can't figure out how I could prevent from using a loop at all and work with built-in numpy functions or smarter ways to handle the data in large arrays in less complicated ways to speed things up.
Any tips?
Best, hubie
//edit: Code from above slightly different
I edited the code and stored arrays in the loop to a variable to "save that view" of the array, also I tried to speed things up by deleting variables which I no longer need (overwriting them is a difference?):
import numpy as np
MSD_list = [(0,0,0),(0.5,1,0),(1,2,0),(0,0,1),(2,1,1),(4,2,1)]
MSD_traces = [[] for x in range(2)]
# MSD_traces is a list that collects the data later on in a nested loop.
MSD_array = np.zeros(len(MSD_list),dtype = [('d_space', 'f8'),('d_time', 'i4'), ('traceID', 'i4')])
MSD_array[:] = MSD_list
del MSD_list
# Now I want to go into the array, select all the data that has a common traceID,
# then I select a subarray with same timelag and go into a second inner loop...
for i in range(2): #2 = the amount of unique traces in my example
MSD_array_ctID = MSD_array[MSD_array['traceID']==i]
max_d_time = np.max(MSD_array_ctID['d_time'])
for k in range(max_d_time+1):
#iterating from d_time 0 to the maximum d_time in that trace
MSD_array_ctID_ctime = MSD_array_ctID[MSD_array_ctID['d_time']==k]
size = MSD_array_ctID_ctime['d_space'].size
if size:
#checking for empty array where np.mean() yields error
msd_mean = np.mean(MSD_array_ctID_ctime['d_space'])
MSD_traces[i].append((msd_mean, k))
del msd_mean
del size
del MSD_array_ctID_ctime
del max_d_time
del MSD_array_ctID
using
timeit()
The time for the first code is 8.65 ms, the time for the second 8.62 ms ... not a huge difference :( I did time it without the printing of the lists at the end.