3

I'm trying to replicate a Perl script in Python, and I am running into some major performance problems.

Essentially, I have a tuple of lists and process one list at a time. The list includes a TYPE (string), ID (string), OVERALL_COUNT (int), TYPE_ID_COUNT (int), and a DROP_KEEP_FLAG (string).

My goal is to read the data into a memory efficient data structure that is fast to access. I want to trim the structure every time I process a new record with the trimming criteria being that no TYPE_ID_COUNT can be included if they are more than 2 being the current TYPE_ID_COUNT (i.e., remove if historical TYPE_ID_COUNT < current TYPE_ID_COUNT - 2)

When I compare this to my Perl code, it is orders of magnitude slower. I have included 2 versions (I replace the dict.keys() with dict). I'm relatively new to Python so I'm sure there are more optimal ways for my code to be written. Would Numpy arrays be better?

import timeit
from collections import defaultdict
from copy import deepcopy

# FIELDS: TYPE, ID, OVERALL_COUNT, TYPE_ID_COUNT, DROP_KEEP_FLAG 

a = (['TYPE_1','000000001',1,1,'K'],['TYPE_2','000000002',2,1,'K'],['TYPE_3','000000001',3,1,'K'],
     ['TYPE_1','000000002',4,1,'K'],['TYPE_1','000000002',5,2,'K'],['TYPE_3','000000002',6,1,'K'],
     ['TYPE_1','000000002',7,3,'K'],['TYPE_1','000000002',8,4,'D'],['TYPE_1','000000002',9,5,'K'],  
     ['TYPE_1','000000001',10,2,'K'],['TYPE_2','000000001',11,1,'K'],['TYPE_2','000000001',12,2,'K'],
     ['TYPE_2','000000001',13,3,'K'],['TYPE_3','000000001',14,2,'K'],['TYPE_3','000000001',15,3,'K'],
     ['TYPE_3','000000002',16,2,'K'],['TYPE_3','000000002',17,3,'K'],['TYPE_3','000000002',18,4,'K'])

expand = a
for x in range(0, 250):
    expand = expand + a

window = 2

def version_1(data,window):
    result = defaultdict(lambda: defaultdict(list))
    output = {}

    for i_idx,i_val in enumerate(data,start=1):

        #ADD NEW ELEMENT IF IT IS A 'K'
        if i_val[4] == 'K':
            result[i_val[0]][i_val[1]].append(i_val[2])

        # TRIM OLD ELEMENTS AND COMPUTE LENGTHS
        for j_idx, j_key in enumerate(result.keys(),start=1):
            j_val = result.get(j_key)
            j_val_cp = deepcopy(j_val)

            output[j_key] = 0 

            for k_idx, k_key in enumerate(j_val_cp.keys(),start=1):
                k_val = j_val.get(k_key)

                for item in (x for x in k_val if x < i_val[2] - window):
                    k_val.remove(item)
                    if not k_val:
                        del j_val[k_key]

                    if k_key == i_val[1]:

                        output[j_key] = len(k_val)
                    #print('Output ' + str(i_idx) + ': ID: ' + i_val[1] + ' , values: ' + str(output.items()))
    return output    

def version_2(data,window):
    result = defaultdict(lambda: defaultdict(list))
    output = {}

    for i_idx,i_val in enumerate(data,start=1):

        #ADD NEW ELEMENT IF IT IS A 'K'
        if i_val[4] == 'K':
            result[i_val[0]][i_val[1]].append(i_val[2])

        # TRIM OLD ELEMENTS AND COMPUTE LENGTHS
        for j_idx, j_key in enumerate(result,start=1):
            j_val = result.get(j_key)
            j_val_cp = deepcopy(j_val)

            output[j_key] = 0 

            for k_idx, k_key in enumerate(j_val_cp,start=1):
                k_val = j_val.get(k_key)

                for item in (x for x in k_val if x < i_val[2] - window):
                    k_val.remove(item)
                    if not k_val:
                        del j_val[k_key]

                    if k_key == i_val[1]:

                        output[j_key] = len(k_val)
                    #print('Output ' + str(i_idx) + ': ID: ' + i_val[1] + ' , values: ' + str(output.items()))
    return output

# timeit.timeit(version_1(2))
start_time = timeit.default_timer()
version_1(expand,2)
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
version_2(expand,2)
print(timeit.default_timer() - start_time) 

Any help would be greatly appreciated!

Brad
  • 813
  • 1
  • 10
  • 20
  • instead of iterating through the dictionary keys, can you not iterate through the items themselves, surely that would provide some benifit rather than iterating the keys and then looking up the item with that key. use `itervalues()` for just the values, or `iteritems` for the key and value. – Tom Mar 16 '16 at 16:50
  • I suggest you profile your code and see where it's spending its time. See question [_How can you profile a Python script?_](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) It's a fairly easy thing to do and will likely tell you what part(s) to spend time optimizing. – martineau Mar 16 '16 at 17:39
  • I don't understand your criteria... can you expand on it? Are you saying you want to keep the last two TYPE_ID_COUNT or remove all that are < current val -2. And how does that relate to the final value that is returned? You keep throwing away the data in the output at `output[j_key] = 0` so I can't figure out what is supposed to be there. Why are you reprocessing everything on each value? Wouldn't you just build the lists and filter them at the end? – tdelaney Mar 16 '16 at 17:46
  • 1
    Your version2 code returns `{'TYPE_1': 0, 'TYPE_2': 0, 'TYPE_3': 0}` but I can't think of any interpretation of your criteria that would do that. ARe you sure its correct? – tdelaney Mar 16 '16 at 17:55
  • 1
    You have a nested dictionary `result[TYPE][ID]` but build flat dictionary `output[TYPE]`. This line `output[j_key] = len(k_val)` keeps overwriting IDs you've already recorded in favor of the final one that happens to be processed. So why process the other ones at all? – tdelaney Mar 16 '16 at 18:39
  • @tdelaney - When this goes into production, I will updating a model with every new line that comes in, do I will using output for every observation that comes in. – Brad Mar 16 '16 at 19:58
  • Hello? Are you interested in helping. Your description is incomplete and your algorithm seems broken so its hard to guess what you want. Is your rule used across all counts read to date, or only for counts with the same TYPE and ID. You only trim preceding values, so with counts of `5, 5, 9, 10, 5, 5,', is the output expected to be `9, 10` or `9, 10, 5, 5`? – tdelaney Mar 16 '16 at 19:58
  • This is a simplified version of what I am actually doing. The real version will count events for each [TYPE][ID] within a time window and remove any events older than that time window. Example: For the [EVENT][TYPE] that is currently being processed, what is the counts for the last hour. – Brad Mar 16 '16 at 20:02
  • what are you trying yo obtain with this code? this is extremely slow and append several copied of the same thing..is it you intention? expand = a for x in range(0, 250): expand = expand + a – EnricoGiampieri Mar 17 '16 at 13:38
  • The expand = a for x in range(0, 250): expand = expand + a was just to create some more sample data to test so that I could see a larger differential in the performance – Brad Mar 17 '16 at 15:29

1 Answers1

0

I struggled to understand your question, so I do not know what your desired output should be. Using lots of nested loops in python is a bad idea. Have you tried using a lookup table method to grab the items you want to filter against? For example:

# a is your 'historical' data
# Now create a look-up table to access elements of interest:
types = defaultdict(list)

for i, item in enumerate(a):
    t = item[0]
    types[t].append(i)

# You can now use your look-up table to grab items of interest e.g:
new_record = ['TYPE_3','000000002',100,2,'K']
type_idxs = types[new_record[0]]

old_types_id_counts = [a[i][2] for i in type_idxs]

# Collect the indexes of the items you want to remove:
f = [type_idxs[i] for i in range(len(type_idxs)) if old_types_id_counts[i] < current_id_count - 2]

# Once you have your indexes of items to remove, you just need to remove those items and update your table
kezzos
  • 3,023
  • 3
  • 20
  • 37