Managing big point lists, best approach?

Question

I have hit a performance wall in my code and have decided to rewrite it, and need some advice on how to tackle this issue. I have a huge list of optical flow data that consists of lists with a frame, X and Y coordinates. like so:

[[[frame,x,y],[frame,x,y]],[[frame,x,y],[frame,x,y]]...]

I have uploaded a sample here: http://pastebin.com/ANUr8bwc

I need to find a way to manage this data so that I can do quick lookups and see what lists contain certain frames.

So far I have looped through all of the data to see what lists contain say frame 34 and 35 and then index them into a new list for reference.

thisFrame = 34
nextFrame = thisFrame + 1
if any(e[0] == thisFrame for e in item) and any(e[0] == nextFrame for e in item): #Check if the item contains this frame and next
    DoStuff()

Doing this a few thousand times for a list of 10.000+ points quickly turns into a bottleneck. So my idea was to make a dict for each frame and in that way easily be able to find what items are available on a certain frame:

{frame, [list1,list2,list3]}

But I think I better ask this time. Is there a good goto method for storing and being able to do lookups in big datasets, to avoid looping through all of them every time you need to do so?

`I have hit a performance wall` how did you come to that conclusion? — thefourtheye, Feb 03 '14 at 07:19
Do you have duplicate frames? If not, you would be better off not storing the `frame` at all, and derive the index based on the frame you want to request. — Attila O., Feb 03 '14 at 07:26
So, can i merge these two frames: [[110, '79.607002', '76.638'], [111, '61.595001', '85.915001']] AND [[110, '257.90201', '108.176'], [111, '232.58501', '108.98']] — James Sapam, Feb 03 '14 at 07:27
@yopy I think so, that would in fact avoid the `any()` lookups. — Attila O., Feb 03 '14 at 07:29

score 1 · Accepted Answer · edited May 23 '17 at 12:23

Dictionaries are one of Python's most heavily optimized data structures. Converting your data into a dictionary would take some time, but afterwards, lookups (e.g. thisFrame in item) are done in O(1) time, which are much faster than lists.

Your best bet would be to convert into a dictionary, using some code like this:

N.B. It looks like your lists are nested twice, you'll have to slightly modify the iteration if this isn't the case.

item_dict = {}
for big_lst in item:
    for lst in big_lst:
        try:
            item_dict[lst[0]] += [lst[1:],] # Append to existing value
        except KeyError:
            item_dict[lst[0]] = [lst[1:],] # Initialize value

Edit 02/05: try/except is faster

item_dict will look like this, with duplicate frames merged so that a single frame lookup will return a list of [x, y] pairs.

item_dict = {
    1: [list1, list2, list3]
    2: [list1, list2]
    3: [list1]
    4: [list1, list2, list3]
}

Lookups from then on will be very fast:

thisFrame = 34
nextFrame = thisFrame + 1
if thisFrame in item_dict and nextFrame in item_dict:
    foo = item_dict[thisFrame] # e.g. [list1, list2]
    bar = item_dict[nextFrame] # e.g. [list1, list2, list3]
    DoStuff()

If you need to keep track of what parent list the individual [x, y] pairs belong to, you could add one additional element of each list which stores the index of the parent list in item:

item_dict = {}
for list_index, big_lst in enumerate(item):
    for lst in big_lst:
        if lst[0] in item_dict:
            item_dict[lst[0]] += [lst[1:]+[list_index],] # Append
        else:
            item_dict[lst[0]] = [lst[1:]+[list_index],] # Initialize

Then, a lookup like

parent_list = item_dict[thisFrame][2] # [x, y, list_index]

will return the parent list which can be accessed:

item[parent_list]

his lookup is not the key, he is trying to find out the key from the matching value, thats what i understood, so its not O(1) — James Sapam, Feb 03 '14 at 07:35
@yopy the two `any` functions return a boolean value that checks whether `thisFrame` and `nextFrame` exist in the first index of any list in `item`. I believe the dictionary lookup towards the end of my answer does the same thing? — jayelm, Feb 03 '14 at 07:36
That was exactly what i was thinking. Having the index in the dict to refrence back to the main struc. Thanks for the help everyone! — user2339945, Feb 04 '14 at 20:10

score 1 · Answer 2 · answered Feb 03 '14 at 07:45

What I am trying to do here :

First i tried to merge all the x,y to the unique frame by creating a dictionary called frame. Second, i revert the dictionary by converting the value into key and key into value.

Please let me know if that works : else i ll modify or remove it.

#!/usr/bin/python

import ast

dict_frame = dict()
def generate_datastructure():
    l = ''
    with open('2d_optical_tracking_data.txt', 'r') as fh:
        l = ast.literal_eval(fh.read())
    frame = dict()
    for ls in l:
        for elm in ls:
            key = elm[0]
            val_list = elm[1:]
            frame.setdefault(key, [])
            frame[key].extend(val_list)

    # convert all the value into set:
    for key, val in frame.iteritems():
        dict_frame[tuple(set(val))] = key

def lookup(key):
    print dict_frame

if __name__ == '__main__':
    tofind = '45.835999'
    generate_datastructure()
    for key, val in dict_frame.iteritems():
        if tofind in key:
            print val

Managing big point lists, best approach?

2 Answers2