2

I have a list of Item objects that have a date attribute. I also have a a single date I am grabbing from the database.

I want to search through the list, find all of the list items that are greater than the date I have returned from the database.

My list of Items objects has over a thousand objects in it, so I want to be as efficient as possible.

I assume that looping over every item in my list and checking if it is greater than the date I returned from the db is not the most efficient way to do this.

class Item(object):    
    def __init__(self, title, link, description, date):
        self.title = title
        self.link = link
        self.description = description
        self.date = date

item_list = [] 
...
#assume I populate the list with a 1,000 Item objects

filtered_list = []
for it in item_list:
    if date_from_db > it.date:
        filtered_list.append(it)
imns
  • 4,996
  • 11
  • 57
  • 80
  • 1
    you may use a list sorted by dates and look the first item greater then the date of query item. – crodjer Nov 12 '10 at 04:39
  • A binary search of list sorted by date will be faster than looping over every item, with one caveat--sorting the list costs more than iterating over the entire list once. So use this approach only if you need to do this search more than once. If just once, iterating over the entire list is as fast as you can get. (I would have written this as an answer, but it's just an expansion of dcrodjer's comment.) – Steven Rumbalski Nov 12 '10 at 05:01
  • @Steven Rumbalski : please provide a binary search implementation as an answer. – Tim McNamara Nov 12 '10 at 05:37
  • @Tim MacNamara: That what the `bisect` module is: a binary search implementation. – S.Lott Nov 12 '10 at 11:04
  • possible duplicate of [Searching a list of objects in Python](http://stackoverflow.com/questions/598398/searching-a-list-of-objects-in-python) – John Mee Sep 11 '12 at 13:51

5 Answers5

5

List comprehensions are a fairly efficient way to do this outside of a database:

[it for it in item_list if date_from_db > it.date]

Otherwise, you could use the filter builtin:

filter(lambda it: it if date_from_db > it.date, item_list)
Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
  • List comprehensions and `filter` are just syntactic sugar for "looping over every item", which the OP felt was inefficient. So I don't think you've addressed the meat of the question. Almost a -1. – Steven Rumbalski Nov 12 '10 at 04:57
  • 2
    @Steven Rumbalski, there's no way to avoid doing that unless you have one of those magic computers that runs on unicorn dust or can make stronger assumptions on the list than have been provided. Even with a clever data structure, you would have to load it which would involve -wait for it- looping over every item. Granted, both answers could have pointed this fact out but both of them also hit on the best way to do this in pure python for the general case. – aaronasterling Nov 12 '10 at 05:02
  • 1
    @aaronasterling A binary search on a list sorted by date would be faster as long as the searching operation is needed multiple times. So there is a better algorithm, not just "unicorn dust". I didn't see it either until I read dcrodjer's comment above. – Steven Rumbalski Nov 12 '10 at 05:05
  • I think it's a safe bet that he's searching the list many times, otherwise performance wouldn't be an issue. Searching a list of 1,000 items takes microseconds on today's computers. – Gabe Nov 12 '10 at 05:08
  • @Steven Rumbalski, good points but now we're assuming that at least two queries will be run or that the list is sorted to begin with. @Gabe, that is a good bet. – aaronasterling Nov 12 '10 at 05:09
  • Well, they're almost syntatic sugar. In Python 2, list comprehensions always use local variables, meaning a significant performance increase than looping through a list in the global namespace. – Tim McNamara Nov 12 '10 at 05:13
  • @Tim I should have said same algorithmic complexity. List comps are faster than for loops, but if you are having a performance issue and that's the only optimization you can make, you'll be saving very little. That said, a list comp is clearer and shorter and the right way to go. – Steven Rumbalski Nov 12 '10 at 05:19
  • Thanks, for all of the comments and answer. I ended up using this, just because of it's simplicity. – imns Nov 12 '10 at 21:38
2

The only way to avoid looping over every item in your list is to sort it by date and then just searching it backwards until you find the last item that's greater than your target date, adding them to your filtered_list as you go.

Or sort your list descending and search forwards until you find the first last item that's greater than your target. This would let you easily modify your loop like this:

filtered_list = [] 
for it in item_list: 
    if date_from_db > it.date: 
        filtered_list.append(it) 
    else:
        break

Alternatively, if you expect more than a few items in your filtered list, it may be faster to do a binary search to find the first item that meets your criteria and use a list slice to copy it to filtered_list:

first = binary_search(item_list, lambda it: cmp(date_from_db, it.date))
if first == -1:
    return []
return item_list[first:]

And here's a binary search function I adapted from the link in the last paragraph. I believe it should work:

def binary_search(a, comp, lo=0, hi=None): 
    if hi is None: 
        hi = len(a) 
    while lo < hi: 
        mid = (lo+hi)//2 
        cmpval = comp(a[mid])
        if cmpval < 0:
            lo = mid+1 
        elif cmpval > 0:
            hi = mid 
        else: 
            return mid 
    return -1 
Community
  • 1
  • 1
Gabe
  • 84,912
  • 12
  • 139
  • 238
  • Steven Rumbalski: Doing a binary search will help him find the first item that matches his criteria, but that doesn't matter because he needs to find *all* of them. – Gabe Nov 12 '10 at 05:11
  • 1
    After he finds the first item, a list slice will be more efficient that manually appending to a list. The slight algorithmic loss should more than be made up for by the efficiency of a slice. – aaronasterling Nov 12 '10 at 05:14
  • On a sorted list, once you know where your first item is, you know where they all are (just cut the list at that point). So for very few comparisons, you get your entire result set. – Steven Rumbalski Nov 12 '10 at 05:17
  • aaronsterling: Either way you have to copy part of the list. I suppose which method would be faster depends on how many items he expects to find. If it's just a few, copying the list in Python will be faster than doing the search. If it's many, copying the list in C will be sufficiently faster that it more than makes up for the added searching. – Gabe Nov 12 '10 at 05:25
1

In response to a claim that this list comp is confusing, I'll post a way to format it that makes it clear. I've been using this a lot recently.

filtered_list = [item                         # What we're collecting
                 for item in item_list        # What we're collecting it over 
                 if date_from_db < item.date] # the conditions

It does turn what could be a one liner into a three liner like it would be with a regular for loop but in cases much worse than this (and even here) it improves readability and lets you have the improved efficiency.

aaronasterling
  • 68,820
  • 20
  • 127
  • 125
0

You can use filter:

filtered_list = filter(lambda(item): date_from_db < item.date, item_list)

Or you can use for comprehension:

filtered_list = [item for item in item_list if date_from_db < item.date]

I believe that people prefer the latter more often, but I like the former. Lambda is just an inline function - you can make that function explicit if you like.

Michael Neale
  • 19,248
  • 19
  • 77
  • 109
  • 1
    yes I prefer the filter. Same applies to the other one (who also pinched my filter !) – Michael Neale Nov 12 '10 at 04:49
  • No, my comment was saying that you had a mistake in your list comp but now you've fixed it. The list comp is pretty much objectively better except in very limited circumstances of which this is not one. Try using timeit. – aaronasterling Nov 12 '10 at 04:51
  • ah sorry - yes I fixed that straight away - surprised you saw it. Haha - yeah - in this case it looks confusing. I don't know why I don't like the comprehensions (I have the same aversion in scala to them - but I get away with it there as no lambda limitations ) ;) – Michael Neale Nov 12 '10 at 04:52
  • In a language with real lambdas, they're much better and you'd have to pry them out of my cold dead hands. Here in python, you should almost never use them. Get used to comprehensions :) They're great. – aaronasterling Nov 12 '10 at 04:54
  • hahah thanks - that confirmed my suspicions. I sometimes give in and use a throwaway function – Michael Neale Nov 12 '10 at 04:57
  • List comprehensions and `filter` are just syntactic sugar for "looping over every item", which the OP felt was inefficient. So I don't think you've addressed the meat of the question. Almost a -1. – Steven Rumbalski Nov 12 '10 at 04:57
  • @Micheal Neale - Just to let you know, I didn't actually steal `filter`, I was editing the question as you answered. Sorry if you felt ripped off. – Tim McNamara Nov 12 '10 at 23:21
  • ah was tongue in cheek. The filter is what I always reach for, but I should learn to love the comprehension. – Michael Neale Nov 14 '10 at 00:58
-1

Sort the list then bisect it. This works the same if you're looking for one specific item and will also solve the puzzle of finding an object based on it's attributes.

# add __eq__ and __lt__ functions to your object so they'll sort by date
class Item(object):
    """blah blah as above"""
    def __eq__(self, other):            
        return self.date == other.date

    def __lt__(self, other):
        """Order by date ascending"""
        return self.date < other.date

# sort that puppy
item_list = sorted(item_list)

# mock up an item to faux search for
item_from_db = Item(None, None, None, date_from_db)

# find the index of it's rightful place
idx = bisect.bisect(item_list, item_from_db)

# aberacadabera
return item_list[idx:]

Implementing an append routine for filtered_list which uses bisect.insort, rather than sorting the whole list in one hit, seems likely to offer performance gains as well. Not tested exactly as posted, but should be enough to get you there.

John Mee
  • 50,179
  • 34
  • 152
  • 186