Python: How to find all the items greater than some number in an unsorted list (large data set)

Question

Although, similar type of questions have been asked by others, for ex. here, but they differed slightly and didn't really get solve my problem, so here I go again.

I have N lists (N>20,000) and each list contains M lists ( M >20,000), in the following manner ( data is dummy):

Key1: [ [4,3,1], [5,1,0] ...... [43,21,0 ] ]   # List 1 with collection of M smaller lists
:
:
KeyN: [ [5,4,1], [55,1,1] ...... [ 221, 0, 0] ] # Nth list

Data is unsorted. Iterating over a list of threshold values one by one, say Threshold =[2, 3, 5, 7, 8], where threshold is applied over middle element, I want to extract all the elements, for all the keys, greater than the threshold value. For ex. going by the data I wrote above, Threshold = 2 would yield

 For Key1: [ [4,3,1], [43,21,0]]
 :
 : 
 For KeyN: [[5,4,1]]

And similarly for other threshold values too. Since, there are too many lists, my observation is that sorting is contribute to lot of overhead and hence I want to avoid it. What is the optimum method of doing this in python ?. One additional important point is that, I am constructing the data myself, so possibly there is a better data structure to store the data to begin with. I am currently storing the data in the form of PersistentList within a Btree container in ZODB, which was suggested here . Following is a snippet of the code used for it:

for Gnodes in G.nodes():      # Gnodes iterates over N values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():  # Hnodes iterates over N values 
        Hvalue =someoperation(Hnodes,Gnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append([Hnodes, score, -1 ])
    transaction.savepoint(True)  
transaction.commit()

Any suggestions on what should be the most efficient way of doing it? Is sorting first indeed the optimum way ?

What is the source of the data? File? Are they uniform integers or ints and floats? — the wolf, Jul 01 '12 at 01:52
As pointed out in the original post, I am generating the data using the above mentioned code. As soon as list for one key is done, it is temporarily written to the disk using the `savepoint()` method. Elements in the inner smaller list are of the following format, `[int, float, zero or 1]`. The range of the `int` variable is from 0 to M and has all the integers in between appearing in ascending order already i.e. from 0 to M because of the way data is being written. — R.Bahl, Jul 01 '12 at 01:58

Mechanical snail · Answer 1 · 2012-07-01T02:33:24.020

4

Use a generator comprehension:

(sublist for sublist in Key1 if sublist[1] > Threshold)

A generator only computes elements on demand, and since it goes through the elements of the list in order, there's no need to sort. (That is, it runs in linear time on the length of each Keyn, rather than M*log(M) for sorting.)

Equivalently, in functional style (only equivalent in Python 3; for Python 2, use itertools.ifilter):

filter(lambda sublist: sublist[1] > Threshold, Key1)

If your Keyn lists are stored in a list (or other subscriptable object), you can process them all at once (some alternative styles shown):

filtered_Keys = [(sublist for sublist in Key if sublist[1] > Threshold)
    for Key in Keys
]

or

filtered_Keys = list(map(
    lambda Key: filter(lambda sublist: sublist[1] > Threshold, Key1),
    Keys
))

Performance of this method relative to sorting

Whether this method is faster than sorting depends on M and the number of thresholds T you have. The running time (for each Key list) is O(M * T). If you sort the list (O(M * log(M))), then you can use binary search for each threshold, giving an overall running time of O(M * log(M) + T * log(M)) = O(max(M, T) * log(M)). Sorting is faster when T is sufficiently large relative to M. We can't know the constants a priori, so test both ways to see whether one is faster given your data.

If neither is fast enough, consider writing your own linear-time sort. For example, radix sort can be generalized to work on (non-negative) floats. If you're really concerned about performance here, you might have to write this as a C or Cython extension.

edited Jul 01 '12 at 02:33

answered Jul 01 '12 at 01:58

Mechanical snail

29,755
14
88
113

I thought of it, but shouldn't it be less efficient than first sorting ? I am not sure though, and if you can comment, it will be great ! – R.Bahl Jul 01 '12 at 02:01
Also, note that, I need to do this for many different threshold values, therefore going by this method will imply that I would have to traverse through the entire list repeatedly for every threshold value. Shouldn't this impact the performance ? – R.Bahl Jul 01 '12 at 02:04
How is `filter()` "python 3 only"? Are you referring to the fact that `filter()` returns a generator in python 3? If so, I would also include the 2.x version: `itertools.ifilter`. – Joel Cornett Jul 01 '12 at 02:31
Thanks for the performance tips. To give an estimate of order, say I have 10,000 Keys, each with 10000 smaller lists, then the number of threshold values I need will be around 400 in increasing order say T=[ .1 .2 .3 ......] – R.Bahl Jul 01 '12 at 02:33
@R.Bahl: That's only 40 million entries, which should be fast enough either way. – Mechanical snail Jul 01 '12 at 02:35
@JoelCornett: That's what I meant. Thanks. – Mechanical snail Jul 01 '12 at 02:36
Problem is this will go up pretty quickly, so I am concerned about the performance. – R.Bahl Jul 01 '12 at 02:40

Luke · Answer 2 · 2012-07-01T03:00:02.773

2

In numpy you can do this easily with an NxMx3 array:

data = array([
    [ [4,3,1], [5,1,0],  [43,21,0]    ],
    [ [5,4,1], [55,1,1], [ 221, 0, 0] ]
    ])
data[ data[:,:,1]>2 ]

This returns:

array([[ 4,  3,  1],
   [43, 21,  0],
   [ 5,  4,  1]])

If you need the locations of the elements that crossed threshold, use argwhere().

Edit:

It's also possible to do multiple threshold comparisons simultaneously:

>>> mask = data[:,:,1,np.newaxis] > array([[[2, 3, 4]]])
>>> data[mask[...,0]]
array([[ 4,  3,  1],
   [43, 21,  0],
   [ 5,  4,  1]])

>>> data[mask[...,1]]
array([[43, 21,  0],
   [ 5,  4,  1]])

>>> data[mask[...,2]]
array([[43, 21,  0]])

edited Jul 01 '12 at 03:00

answered Jul 01 '12 at 02:30

Luke

11,374
2
48
61

Interesting. I didn't think about it. Do you have an estimate of the performance of this as compared to sorting or the solution mentioned by Mechanical snail – R.Bahl Jul 01 '12 at 02:37
I think NumPy requires all elements of an `array` to be of the same type. That doesn't invalidate this method (since `float`s can represent all small integers exactly), but remember to cast the results back to `int` when necessary. – Mechanical snail Jul 01 '12 at 02:40
Or just use `dtype=numpy.int64` when constructing the array. – Chinmay Kanchi Jul 01 '12 at 02:44
I don't know how the performance will be since I'm not sure how numpy implements this in the background. I'd be interested to see some benchmarks, though. Sorting certainly could help, but as others have mentioned, this depends on the details of your data and thresholds. – Luke Jul 01 '12 at 02:48
Given Numpy is generally very fast on array, its definitely going to be interesting. Let me go ahead and experiment with it. Will let you know on how it goes. Thanks for all the suggestions ! – R.Bahl Jul 01 '12 at 03:10

Python: How to find all the items greater than some number in an unsorted list (large data set)

2 Answers2

Performance of this method relative to sorting