20

Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...

ooboo
  • 16,259
  • 13
  • 37
  • 32

11 Answers11

21

A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.

import heapq

def nth_largest(n, iter):
    return heapq.nlargest(n, iter)[-1]

Example Usage:

>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920

Confirm result by sorting:

>>> list(sorted(iter))[-10]
920
FogleBird
  • 74,300
  • 25
  • 125
  • 131
  • 4
    This works well (linear time) if you want the nth largest or smallest item(s), where n is a constant. If n is half the length of the list (i.e. you want the median), this is still O(nlogn) time. – mgold Oct 11 '12 at 17:34
  • This not an in place solution, Quickselect will not add O(n) extra memory as this solution would. So for very large arrays as the question asks, this probably wouldn't be the most efficient. – db1234 Jan 02 '17 at 23:57
18

Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.

Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).

Dario
  • 48,658
  • 8
  • 97
  • 130
  • Nice link. I believe this is the best. – Jeff Meatball Yang Jun 23 '09 at 20:26
  • 10
    Unfortunately, the link "Another example" now leads to a protected web page at MIT, that you must have permission to access. – Beel Feb 10 '13 at 05:48
  • [NumPy has this built-in](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.partition.html), although it's kind of a weird dependency to pull in if you're not already making use of its ndarray functionality. – user2357112 Aug 04 '16 at 01:42
4

A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).

Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison. In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.

SPWorley
  • 11,550
  • 9
  • 43
  • 63
3

You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.

Andrew Hare
  • 344,730
  • 71
  • 640
  • 635
3

You could try the Median of Medians method - it's speed is O(N).

user183037
  • 2,549
  • 4
  • 31
  • 42
1

Use heapsort. It only partially orders the list until you draw the elements out.

UncleO
  • 8,299
  • 21
  • 29
1

You essentially want to produce a "top-N" list and select the one at the end of that list.

So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.

After you finish scanning, pick the last item in your top-N list.

An example for ints and N = 5:

int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value

for(int i = 0; i < largeArray.length; i++) {
    if(largeArray[i] > top5[4]) {
       // insert into top5:
       top5[4] = largeArray[i];

       // resort:
       quickSort(top5);
    }
}
Jeff Meatball Yang
  • 37,839
  • 27
  • 91
  • 125
1

As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).

However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).

According to Wikipedia, this is the best selection algorithm:

 function findFirstK(list, left, right, k)
     if right > left
         select pivotIndex between left and right
         pivotNewIndex := partition(list, left, right, pivotIndex)
         if pivotNewIndex > k  // new condition
             findFirstK(list, left, pivotNewIndex-1, k)
         if pivotNewIndex < k
             findFirstK(list, pivotNewIndex+1, right, k)

Its complexity is O(n)

Unknown
  • 45,913
  • 27
  • 138
  • 182
  • I believe the Tournament Algorithm, see Dario's links, is what you are shooting for. It has an operation of O(n + k*log(n)). – tgray Jun 23 '09 at 20:43
  • 1
    My mistake, though I would be interested to see a full implementation of this in Python. – tgray Jun 24 '09 at 11:44
0

One thing you should do if this is in production code is test with samples of your data. For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.

The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.

So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.

0

You can keep two different counts for each element -- the number of elements bigger than the element, and the number of elements lesser than the element.

Then do a if check N == number of elements bigger than each element -- the element satisfies this above condition is your output

check below solution

def NthHighest(l,n):
    if len(l) <n:
        return 0

    for i in range(len(l)):
        low_count = 0
        up_count = 0

        for j in range(len(l)):
            if l[j] > l[i]:
                up_count = up_count + 1
            else:
                low_count = low_count + 1

        # print(l[i],low_count, up_count)
        if up_count == n-1:
            #print(l[i])
            return l[i]

# # find the 4th largest number 

l = [1,3,4,9,5,15,5,13,19,27,22]
print(NthHighest(l,4))  

-- using the above solution you can find both - Nth highest as well as Nth Lowest

Sankar
  • 546
  • 4
  • 15
0

If you do not mind using pandas then:

import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)

The above code will show you the N largest values along with the index position of each value.

Hope it helps. :-)

Pandas Nlargest Documentation

Jack
  • 173
  • 3
  • 9