Having performance issues processing large data amounts in python using arrays

Question

I am doing a tracking expirement:

Consider a trace of a positions (of a reflecting gold bead for example) in a field of view measured at different consecutive times points on a camera.

Now I'd like to write my own code for handling the data... My input is in the form of a numpy structured array and I take what I need from it.

Basically I start with a list with entries in the form of (position, time) for each point.

I iterate through all points and calculate the distance in space and time of each point in the data set to those points following after it in the measurement.

Depending on the amount of data, I end up with many hundreds of thousand (distance_space/distance_time) tuples. These I need to calculate so-called MSD curves (MSD: mean squared distance) where I plot the distance of a e.g. the gold bead from it's origin in space over the time that passed.

Now I don't only measure 1 trace but let's say 200... so I want to differentiate between the seperate traces also which opens up one more dimension.

I am able to extract the calculate datapoints into a HUGE list

MSD_list

with e.g. > 100.000.000 entries, each entry being a tuples of 3:

[(distance_space, distance_time, traceID),...,...]

That already takes some time but is still doable... I create a numpy array and store that list into it (takes ~20 seconds), working example below with the list containing 3 made-up tuples of 2 seperate traces, so very small amount of data...

import numpy as np

MSD_list = [(0,0,0),(0.5,1,0),(1,2,0),(0,0,1),(2,1,1),(4,2,1)]
MSD_traces = [[] for x in range(2)]

#   MSD_traces is a list that collects the data later on in a nested loop.

MSD_array = np.zeros(len(MSD_list),dtype = [('d_space', 'f8'),('d_time', 'i4'), ('traceID', 'i4')])
MSD_array[:] = MSD_list

#   Now I want to go into the array, select all the data that has a common traceID,
#   then I select a subarray with same timelag and go into a second inner loop...

for i in range(2): #2 = the amount of unique traces in my example

    for k in range(np.max(MSD_array[MSD_array['traceID']==i]['d_time'])+1): 
        #iterating from d_time 0 to the maximum d_time in that trace
        
        if MSD_array[(MSD_array['traceID']==i) & (MSD_array['d_time']==k)]['d_space'].size: 
            #checking for empty array where np.mean() yields error
            
            msd_mean = np.mean(MSD_array[(MSD_array['traceID']==i) & (MSD_array['d_time']==k)]['d_space'])
            MSD_traces[i].append((msd_mean, k))

print(MSD_traces[0])
print(MSD_traces[1])

Output:

[(0.0, 0), (0.5, 1), (1.0, 2)]
[(0.0, 0), (2.0, 1), (4.0, 2)]

The code works fine. It just takes ages. I bet I use lists and numpy arrays sub-optimally

I the example I end up with a list that contains both traces MSD_traces[0] and MSD_traces[1], each containing the mean-distance and the corresponding time lag, respectively. In general, the length of the traces is not the same.

I already tried creating numpy arrays in the loop for each trace instead of using one that contains ALL traces... pretty much the same result with performance

I can't figure out how I could prevent from using a loop at all and work with built-in numpy functions or smarter ways to handle the data in large arrays in less complicated ways to speed things up.

Any tips?

Best, hubie

//edit: Code from above slightly different

I edited the code and stored arrays in the loop to a variable to "save that view" of the array, also I tried to speed things up by deleting variables which I no longer need (overwriting them is a difference?):

import numpy as np

MSD_list = [(0,0,0),(0.5,1,0),(1,2,0),(0,0,1),(2,1,1),(4,2,1)]
MSD_traces = [[] for x in range(2)]

#   MSD_traces is a list that collects the data later on in a nested loop.

MSD_array = np.zeros(len(MSD_list),dtype = [('d_space', 'f8'),('d_time', 'i4'), ('traceID', 'i4')])
MSD_array[:] = MSD_list

del MSD_list

#   Now I want to go into the array, select all the data that has a common traceID,
#   then I select a subarray with same timelag and go into a second inner loop...

for i in range(2): #2 = the amount of unique traces in my example
    
    MSD_array_ctID = MSD_array[MSD_array['traceID']==i]
    max_d_time = np.max(MSD_array_ctID['d_time'])
    
    for k in range(max_d_time+1): 
        #iterating from d_time 0 to the maximum d_time in that trace
        MSD_array_ctID_ctime = MSD_array_ctID[MSD_array_ctID['d_time']==k]
        size = MSD_array_ctID_ctime['d_space'].size
        
        if size: 
            #checking for empty array where np.mean() yields error

            msd_mean = np.mean(MSD_array_ctID_ctime['d_space'])
            MSD_traces[i].append((msd_mean, k))
            del msd_mean
        
        del size
        del MSD_array_ctID_ctime
    
    del max_d_time
    del MSD_array_ctID

using

timeit()

The time for the first code is 8.65 ms, the time for the second 8.62 ms ... not a huge difference :( I did time it without the printing of the lists at the end.

`MSD_array[(MSD_array['tid']==i) & (MSD_array['tlg']==k)]['msd']` -- you calculate this first time, just grab the size and throw the whole array away, only to calculate it again in the next statement. That's probably a significant waste. And not the only case in your algorithm. Shoving all the data into one giant array, and then having to repeatedly search it all for small amounts of records is probably not the best idea. I'd start with one array per trace id (this can be split up in a single pass). — Dan Mašek, Oct 29 '19 at 16:37
And since you want it grouped by lag time as well, also an array per lag time. Also single pass. Once you've got those arrays, just one linear pass over all of them calculating the means. — Dan Mašek, Oct 29 '19 at 16:44
as far as I see I only calculate the size once per loop for each timelag (d_time - sorry I changed the variables nomenclature in an edit), you mean I should save it into a variable, use the variable in the if statement and delete that variable after I am through the statement? — hubie, Oct 29 '19 at 17:11
I don't mean the size calculation itself, it's the indexing operation that gives you a view of the original array with only the given trace ID and d_time. That has to search through the entire array. Save the result of that, and use that saved view in the calculation of size and mean. | Similarly, you could save the result of `MSD_array[MSD_array['traceID']==i]`, so that you have a view with only that particular trace. — Dan Mašek, Oct 29 '19 at 17:25
@Dan Mašek Grouping by traceID will yield an array of lists with tuples of (d_space, d_time), but accessing the list elemnts will result in the same nested loop, won't it? If I pass the list corresponding to one trace just after creation to a function that calculates the norm of that trace then it will do so in one loop, but still I am calling the function from another loop, so there are still two loops involved. I guess I am not seeing the big picture :-/ - I will look into it, thanks for your quick response — hubie, Oct 29 '19 at 17:26
@hubie You'll probably have to time it with many more data points than the 6 in your example to see the difference. Try generating some pseudo-data using np.random.randint to make millions of rows. — Fiver, Oct 29 '19 at 17:46
Yes, you have to do the performance measurements with a significant amount of data, otherwise the difference will be minimal (it's the large amount of data that matters, searching a 6 row array is trivial). — Dan Mašek, Oct 29 '19 at 17:49
I couldn't believe it when I saw it... I simulated starting with 100 traces with 100k points distributed randomly. The code saving the pointer in the variable was ~50 times faster. With 10 Million points distributed over a 100 traces ... It is still calculating the first trace while I am writing this, when the optimized code went through it in about a Minute. Really astounding. @Mad Physicist I was under the impression, that working with a numpy array I can just better "grab" what I need... I am really not sure about the performance of what I am doing there... would you just use lists? — hubie, Oct 29 '19 at 19:26
@hubie. I would preallocate the array if you could, and write to it directly — Mad Physicist, Oct 29 '19 at 19:55
how does that work when I don't know its size before I searched the data in the loops for their respective traceIDs? The problem, as far as I think to know it, is, that np.arrays work great with data that has a fixed size, doesn't it? Or could you be more precise, *which* list/array should I preallocate in the code? — hubie, Oct 29 '19 at 20:07
@hubie I would also suggest learning [how to use a profiler](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) to track down code that is taking more time in order to target areas to optimize. The pro edition of PyCharm has nice profiler integration. — Fiver, Oct 29 '19 at 21:28
@hubie "Really astounding" -- it's a matter of reducing the amount of work. With 100 traces, we cut down the number of array entries the inner loop has to search 100 fold, and we reduced the number of searches per iteration from 2 to 1. As the number of iterations the inner loop has to do increases (as well as the number of traces), the savings rise significantly. Still, this isn't yet ideal, since you can do this in linear time. Even with pure Python's interprerter overhead, for large number of items this wins (100 traces,100k points,1000 distances in 5 sec) . https://pastebin.com/ZFqcrrCS — Dan Mašek, Oct 30 '19 at 00:35
@hubie Slighly better: https://pastebin.com/TiZcqACA -- and that's just using the list of tuples, for some reason the structured numpy arrays perform much worse here. | Compiling this function with Cython, even with minimal annotations yields another 2x speedup. | Still, writing this directly in some compiled language should yield further (significant) improvements, and allow itself for parallelization to take advantage of current multi-core CPUs. — Dan Mašek, Oct 30 '19 at 01:25

Having performance issues processing large data amounts in python using arrays

//edit: Code from above slightly different

0 Answers0