4

I having a list of tuple as describes below (This tuple is sorted in decreasing order of the second value):

from string import ascii_letters
myTup = zip (ascii_letters, range(10)[::-1])
threshold = 5.5

>>> myTup
[('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), \
('i', 1), ('j', 0)]

Given a threshold, what is the best possible way to discard all tuples having the second value less than this threshold.

I am having more than 5 million tuples and thus don't want to perform comparison tuple by tuple basis and consequently delete or add to another list of tuples.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Curious
  • 3,507
  • 8
  • 28
  • 30
  • 2
    Since your list is already sorted: How about first doing a [binary search](http://en.wikipedia.org/wiki/Binary_search_algorithm) to find the index of the first tuple below the threshold. – Lukas Graf Sep 12 '12 at 15:42

5 Answers5

7

Since the tuples are sorted, you can simply search for the first tuple with a value lower than the threshold, and then delete the remaining values using slice notation:

index = next(i for i, (t1, t2) in enumerate(myTup) if t2 < threshold)
del myTup[index:]

As Vaughn Cato points out, a binary search would speed things up even more. bisect.bisect would be useful, except that it won't work with your current data structure unless you create a separate key sequence, as documented here. But that violates your prohibition on creating new lists.

Still, you could use the source code as the basis for your own binary search. Or, you could change your data structure:

>>> myTup
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), 
 (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')]
>>> index = bisect.bisect(myTup, (threshold, None))
>>> del myTup[:index]
>>> myTup
[(6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')]

The disadvantage here is that the deletion may occur in linear time, since Python will have to shift the entire block of memory back... unless Python is smart about deleting slices that start from 0. (Anyone know?)

Finally, if you're really willing to change your data structure, you could do this:

[(-9, 'a'), (-8, 'b'), (-7, 'c'), (-6, 'd'), (-5, 'e'), (-4, 'f'), 
 (-3, 'g'), (-2, 'h'), (-1, 'i'), (0, 'j')]
>>> index = bisect.bisect(myTup, (-threshold, None))
>>> del myTup[index:]
>>> myTup
[(-9, 'a'), (-8, 'b'), (-7, 'c'), (-6, 'd')]

(Note that Python 3 will complain about the None comparison, so you could use something like (-threshold, chr(0)) instead.)

My suspicion is that the linear time search I suggested at the beginning is acceptable in most circumstances.

senderle
  • 145,869
  • 36
  • 209
  • 233
  • 2
    Good point about the values being sorted. How about a binary search to speed that up? – Vaughn Cato Sep 12 '12 at 15:41
  • You can't using `bisect` like that, because you have to compare only the threshold and not the letters. A `key` argument for `bisect` would be great... – Bakuriu Sep 12 '12 at 15:47
  • 1
    Also, bisect only sorts in ascending order. According to the [docs](http://docs.python.org/library/bisect.html#other-examples), it looks like they recommend making a list (from mapping your key function over the original list) and doing a bisect on that list. – Sam Mussmann Sep 12 '12 at 15:49
  • 1
    This is surprisingly difficult to do correctly (I was halfway through making a reversed-view wrapper before I decided it was silly). The `bisect` module is definitely less convenient than it could be. – DSM Sep 12 '12 at 16:03
  • Yeah, I think that too. I've write on comp.lang.python, just to see what are the opinions about this fact. Really, I think that binary search it's quite easy to implement, so I don't see why they shouldn't provide such basic features. Also there are no drawbacks.. you just have to keep in mind that the "keys" would be reevaluated every time and decide what to do. – Bakuriu Sep 12 '12 at 16:06
2

Here's an exotic approach that wraps the list in a list-like object before performing bisect.

import bisect

def revkey(items):
    class Items:
        def __getitem__(self, index):
            assert 0 <= index < _len
            return items[_max-index][1]
        def __len__(self):
            return _len
        def bisect(self, value):
            return _len - bisect.bisect_left(self, value)
    _len = len(items)
    _max = _len-1
    return Items()

tuples = [('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), ('i', 1), ('j', 0)]

for x in range(-2, 12):
    assert len(tuples) == 10
    t = tuples[:]
    stop = revkey(t).bisect(x)
    del t[stop:]
    assert t == [item for item in tuples if item[1] >= x]
  • +1: this is the sort of thing I was thinking of above. On reflection I'm actually a little surprised I've never needed a reversed view before. – DSM Sep 12 '12 at 20:52
1

Maybe a bit faster code than of @Curious:

newTup=[]
for tup in myTup:
    if tup[1]>threshold:
        newTup.append(tup)
    else:
        break

Because the tuples are ordered, you do not have to go through all of them.

Another possibility would also be, to use bisection, and find the index i of last element, which is above threshold. Then you would do:

newTup=myTup[:i]

I think the last method would be the fastest.

Nejc
  • 692
  • 6
  • 13
0

Given the number of tuples you're dealing with, you may want to consider using NumPy.

Define a structured array like

my_array= np.array(myTup, dtype=[('f0',"|S10"), ('f1',float)])

You can access the second elements of your tuples with myarray['f1'] which gives you a float array. Youcan know use fancy indexing techniques to filter the elements you want, like

my_array[myarray['f1'] < threshold]

keeping only the entries where your f1 is less than your threshold..

Pierre GM
  • 19,809
  • 3
  • 56
  • 67
0

You can also use itertools e.g.

from itertools import ifilter
iterable_filtered = ifilter(lambda x : x[1] > threshold, myTup)

If you wanted an iterable filtered list or just:

filtered = filter(lambda x: x[1] > threshold, myTup)

to go straight to a list.

I'm not too familiar with the relative performance of these methods and would have to test them (e.g. in IPython using %timeit).

Martin
  • 7,089
  • 3
  • 28
  • 43