Retrieving the top 100 numbers from one hundred million of numbers

Question

One of my friend has been asked with a question

Retrieving the max top 100 numbers from one hundred million of numbers

in a recent job interview. Do you have any idea to come up with an efficient way to solve it?

Funny. I was just asked a very similar question in an interview today (phrased very differently, but boiled down to a similar problem). Wish you would have posted this yesterday! :p — hemp, Mar 31 '10 at 06:14
Related question: "Millions of 3D points: How to find the 10 of them closest to a given point?" http://stackoverflow.com/questions/2486093/millions-of-3d-points-how-to-find-the-10-of-them-closest-to-a-given-point The same answers are applicable heap, partial_sort, even linear search. The problem might be constrained by IO and therefore any algorithm will do. — jfs, Mar 31 '10 at 06:29
Does a kd-tree help with nearest neighbours in 1-D (as in this case)? — Lazer, Mar 31 '10 at 09:02
A question like this almost REQUIRES you to ask for more information. If they answer you can make your own assumptions for everything which is not clear, you can assume the data is provided to you in a sorted list, and you can just get the first 100 numbers :) — Fortega, Mar 31 '10 at 15:55
@eSKay: Look at the code http://stackoverflow.com/questions/2486093/millions-of-3d-points-how-to-find-the-10-of-them-closest-to-a-given-point/2486341#2486341 There is heap-, partial_sort-, linear search- based C++ programs that can be used almost directly (just use your reading and comparison functions instead of `getpoint()` and `less_distance`) — jfs, Mar 31 '10 at 18:27
Wasn't there another SO question about getting the top 100 from a linked list? I can't find the sod. — Tom Hawtin - tackline, Apr 01 '10 at 16:11

score 66 · Accepted Answer · edited Feb 14 '14 at 19:48

66

Run them all through a min-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.

A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.

Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest():

import heapq

def funnel(n, numbers):
    if n == 0: return []
    heap = numbers[:n]
    heapq.heapify(heap)
    for k in numbers[n:]:
        if heap[0] < k:
            heapq.heapreplace(heap, k)
    return heap

>>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8])
[5, 8, 6, 9]

edited Feb 14 '14 at 19:48

Billz

7,879
7
33
35

answered Mar 31 '10 at 05:15

Darius Bacon

14,921
5
53
53

3

+1: But only replace the min if the new number is >min, though... (or use a min-heap of 101 elements) – Reed Copsey Mar 31 '10 at 05:18
shouldn't it be "replace the current min m with `max(k, m)`". Did I understand something wrong here? – Lazer Mar 31 '10 at 08:24
+1 An elegant solution since you leverage a very beautiful data structure properly. Well done. – gotgenes Mar 31 '10 at 20:15
one line shorter using `heapq.heappushpop(heap, k)` instead of `if heap[0] < k: heapq.heapreplace(heap, k)` – Andriy Tylychko Nov 20 '11 at 19:58

score 11 · Answer 2 · answered Mar 31 '10 at 05:36

Ok, here is a really stupid answer, but it is a valid one:

Load all 100 million entries into an array
Call some quick sort implementation on it
Take last 100 items (it sorts ascending), or first 100 if you can sort descending.

Reasoning:

There is no context on the question, so efficiency can be argued - what IS efficient? Computer time or programmer time?
This method is implementable very fast.
100 million entries - numbers, are just a couple of hundred mb, so every decent workstaiton can simply run that.

It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers.

As such, the question is really hard to answer - efficiency first has to be defined.

Yes, but as pointed out in my answer, (at least if you're using the right language) it's just as easy to do the job more efficiently. — Jerry Coffin, Mar 31 '10 at 17:58

score 5 · Answer 3 · answered Mar 31 '10 at 05:34

5

Mergesort in batches of 100, then only keep the top 100.

Incidentally, you can scale this in all sorts of directions, including concurrently.

answered Mar 31 '10 at 05:34

MSN

53,214
7
75
105

What if the top hundred were in the same batch? – nikhil Jul 08 '12 at 13:16
@nikhil, nothing different would happen if they are all in the same batch. Granted, when I say "mergesort in batches of 100", I mean mergesort two collections of 100 numbers each. – MSN Jul 11 '12 at 19:51

Jerry Coffin · Answer 4 · 2016-12-15T01:09:03.950

If the data is already in an array that you can modify, you could use a variant of Hoare's Select algorithm, which is (in turn) a variant of Quicksort.

The basic idea is pretty simple. In Quicksort, you partition the array into two pieces, one of items larger than the pivot, and the other of items smaller than the pivot. Then you recursively sort each partition.

In the Select algorithm, you do the partitioning step exactly as before -- but instead of recursively sorting both partitions, you look at which partition contains the elements you want, and recursively select ONLY in that partition. E.g., assuming your 100 million items partition nearly in half, the first several iterations you're going to look only at the upper partition.

Eventually, you're likely to reach a point where the portion you want "bridges" two partitions -- e.g., you have a partition of ~150 numbers, and when you partition that you end up with two pieces of ~75 apiece. At that point, only one minor detail changes: instead of rejecting one partition and continuing work only the other, you accept the upper partition of 75 items, and then continue looking for the top 25 in the lower partition.

If you were doing this in C++, you could do this with std::nth_element (which will normally be implemented approximately as described above). On average, this has linear complexity, which I believe is about as good as you can hope for (absent some preexisting order, I don't see any way to find the top N elements without looking at all the elements).

If the data's not already in an array, and you're (for example) reading the data from a file, you usually want to use a heap. You basically read an item, insert it into the heap, and if the heap is larger than you target (100 items, in this case) you remove one and re-heapify.

What's probably not so obvious (but is actually true) is that you don't normally want to use a max-heap for this task. At first glance, it seems pretty obvious: if you want to get the maximum items you should use a max heap.

It's simpler, however, to think in terms of the items you're "removing" from the heap. A max heap lets you find the one largest item in the heap quickly. It is not, however, optimized for finding the smallest item in the heap.

In this case, we're interested primarily in the smallest item in the heap. In particular, when we read each item in from the file, we want to compare it to the smallest item in the heap. If (and only if) it's larger than the smallest item in the heap, we want to replace that smallest item currently in the heap with the new item. Since that's (by definition) larger than the existing item, we'll then need to sift that into the correct position in the heap.

But note: if the items in the file are randomly ordered, as we read through the file, we fairly quickly reach a point at which most items we read into the file will be smaller than the smallest item in our heap. Since we have easy access to the smallest item in the heap, it's fairly quick and easy to do that comparison, and for smaller items never insert in the heap at all.

@Darius:Yes, that's quite true. Depending on the situation, that could be a huge problem, or none at all -- in particular, the question says you're given the numbers; if you're given them already in memory, then it makes about the most effective possible use of memory space -- virtually none extra at all. On the other hand, if you're given the data in a file (for example) allocating that much space could be a real problem (allocating virtual memory could take longer than your solution would to run, just for one example). — Jerry Coffin, Mar 31 '10 at 19:49

score 4 · Answer 5 · answered Mar 31 '10 at 05:25

4

By TOP 100, do you mean 100 largest? If so:

SELECT TOP 100 Number FROM RidiculouslyLargeTable ORDER BY Number DESC

Make sure you tell the interviewer that you assume the table is indexed properly.

answered Mar 31 '10 at 05:25

mcliedtk

1,031
1
12
20

8

Who said anything about a DB? – Reed Copsey Mar 31 '10 at 05:27
6

Who said not in a db? With 100 million data points ASSSUMING they are not in a text file is sensible - after all, this is a typical OLTP / Data warehouse type of query. – TomTom Mar 31 '10 at 05:30
sorry,but if this is not a question about SQL. assume that interviewer needs an algorithm to solve this problem by writing programming code or pseudo-code. – didxga Mar 31 '10 at 05:31
1

Yes, the question did not mention a database, though it did not mention any sort of data structure to begin with. In an actual interview, I would ask for clarification, but the question is vague enough to suggest that this answer is 'one' solution. I don't see how assuming the numbers are in an array, a file, a tree, and so forth is any different from assuming the presence of a database. – mcliedtk Mar 31 '10 at 05:52
3

If I asked that question in an interview and received that answer, I would probably say something like, "Great! Now supposing you are writing the RDBMS, how would you implement that query?" – hemp Mar 31 '10 at 06:17
1

Eric Lippert explained this situation very well in this blog entry (http://blogs.msdn.com/ericlippert/archive/2009/03/20/it-s-not-magic.aspx) where he said, "Where I start to see magical thinking is when I ask the candidate to assume that their chosen solution is simply not yet implemented on their platform." – hemp Mar 31 '10 at 06:26
3

@hemp, "Why in the world are you asking me about how to develop an RDBMS, when I'm applying as Perl web developer?" – P Shved Mar 31 '10 at 12:40
2

Because even a lowly Perl web developer should be a solid enough engineer that you can reason intelligibly about how you might implement that query. I'm not expecting that you've done it before, or that you will ever do it in the future. I don't hire based on experience; I hire based on demonstrable intelligence and ability to solve problems using basic computer science fundamentals. – hemp Apr 01 '10 at 16:59

tloflin · Answer 6 · 2010-03-31T19:44:59.973

1

There's no reason to sort the whole list. This should be doable in O(n) time. In pseudocode:

List top = new List

for each num in entireList
    for i = 0 to top.Length
        if num > top[i] then
            top.InsertBefore(num, i)
            if top.Length > 100 then
                top.Remove(top.Length - 1)
            end if
            exit for
        else
            if i = top.Length - 1 and i < 100 then
                top.Add(num)
            end if
        end if
    next
next

edited Mar 31 '10 at 19:44

answered Mar 31 '10 at 19:29

tloflin

4,050
1
25
33

For this to work, you would have to keep the "top" List sorted. – mbeckish Mar 31 '10 at 19:38
Yeah, fixed, I forgot about that part. Still O(n), I think. – tloflin Mar 31 '10 at 19:45
2

Your algorithm is only O(n) because you're asked to find the top 100, rather than the top `k`, where `k < n`. In the worst case scenario, the original list is sorted from smallest to largest, and so, for each number in the original list, you will always scan to the end of `top`. If instead you were asked to find the top `k`, your algorithm runs in O(kn), which may be poor for large `k` and `n`. – gotgenes Mar 31 '10 at 20:10
Also, `if i = top.Length - 1` should be `if `**`==`**` top.Length - 1`. – gotgenes Mar 31 '10 at 20:12
1

"Your algorithm is only O(n) because you're asked to find the top 100." - That is correct. Of course if the problem were different it would require a different algorithm. But 100, compared to 100 million, is not an issue. Also, my code is pseudocode, so as long as you understood it, the syntax is irrelevant. – tloflin Mar 31 '10 at 20:42

score 0 · Answer 7 · answered Jun 20 '14 at 09:04

0

Heapify the array in O(n). Then take out top 100 elements.

answered Jun 20 '14 at 09:04

N A

831
2
8
28

score 0 · Answer 8 · answered Mar 06 '15 at 09:09

0

I store first 100 numbers in Max -Heap of size 100.

At last level ,I keep track of minimum number and new number I insert and check with min number.Whether incoming number is candidate for top 100.

-- Again I call reheapify so I always have max heap of top 100.

So its complexity is O(nlogn).

answered Mar 06 '15 at 09:09

user3212412

1

1

Max heap or min heap? i think you mean min heap – Hengameh Jun 18 '15 at 03:28

score 0 · Answer 9 · answered Jul 19 '11 at 13:17

@darius can actually be improved !!!
By "pruning" or deferring the heap-replace operation as required

Suppose we have a=1000 at the top of the heap
It has c,b siblings
We know that c,b>1000

      a=1000
  +-----|-----+
 b>a         c>a




We now read the next number x=1035
Since x>a we should discard a.
Instead we store (x=1035, a=1000) at the root
We do not (yet) bubble down the new value of 1035 
Note that we still know that b,c<a but possibly b,c>x
Now, we get the next number y
when y<a<x then obviously we can discard it 

when y>x>a then we replace x with y (the root now has (y, a=1000))
=> we saved log(m) steps here, since x will never have to bubble down

when a>y>x then we need to bubble down y recursively as required

Worst run time is still O(n log m) 
But average run time i think might be O(n log log m) or something
In any case, it is obviously a faster implementation

-1 This doesn't work as written. Say the current 100 top numbers in the heap are even numbers between 1000 to 1198. So you added x=1035 and have (x=1035, a=1000) at root. Now, y=1037 comes along, so y>x>a and you say keep the root with (y=1037, a=1000), and dropped x. But x=1035 is still better than 1002 (still kept). (And it doesn't work the other way either if you drop a, or do a step like this if x>y>a). Also, you have no justification for claiming "might be O(n log log m)". Your scheme (if it could work) would only cut number of lg M steps in half; so would stay O(n lg m). — dr jimbob, Mar 27 '13 at 17:50

score -1 · Answer 10 · answered Feb 14 '14 at 20:43

-1

First iteration:

Quicksort, take top 100. O(n log n). Simple, easy to code. Very obvious.

Better? We are working with numbers, do a radix sort (linear time) take the top 100. I would expect this is what the interviewer is looking for.

Any other considerations? Well, a million numbers isn't a lot of memory, but if you want to minimize memory, you keep a max 100 numbers encountered so far and then just scan the numbers. What would be the a best way?

Some have mentioned a heap, but a bit better solution might be a doubly-linked list, where you keep the pointer to the minimum of the top 100 found so far. If you encounter a number a that is bigger than the current smallest in the listed, compared with the next element, and move the number from next to the current until you find a place for the new number. (This is basically just a specialized heap for the situation). With some tuning (if the number is greater the current minimum, compare with current maximum to see which direction to walk list to find insertion point) this would be relatively effective, and would only take like 1.5k of memory.

answered Feb 14 '14 at 20:43

D'Nabre

2,226
16
13

That is not what a heap is. Scanning a sorted doubly-linked list for insertion is O(n), whereas insertion with preserved sort order in a heap is O(log n) (=depth of the tree structure). Trying to optimize it with a binary search doesn't help, because the doubly linked list doesn't have random access, but forces you to traverse the entire list to get to the next halfway point. – Jens Roland Nov 28 '14 at 13:22
@JensRoland I didn't describe an insertion sort, a double-linked list with the extra pointer to the minimum element lets you keep a running 100 elements by swapping out the least element of the 100. This pushes the bigger elements towards the top. Which is pretty close to what a heapify will do. It's linear time, one pass over the list with constant operations for element. May not have described it very well. Keep in mind there lots of kinds of heaps. – D'Nabre Nov 28 '14 at 16:56
Ah - but the fact remains that keeping your least-element pointer updated is still a linear operation, so your algorithm is going to be O(n*m) for the general "Top m" case, whereas the minheap solution achieves O(n log m) – Jens Roland Nov 29 '14 at 10:27
@JensRoland I'm not describing the algorithm clearly/correctly. Can't fit pseudocode in a comment, and answer far too old to add it to. – D'Nabre Nov 29 '14 at 17:44

score -1 · Answer 11 · answered Feb 15 '14 at 05:46

-1

Suppose mylist is a list of hundred million data. so we can sort the list and take the last hundred data from mylist.

mylist.sort()

mylist[-100:]

Second way:

import heapq

heapq.nlargest(100, mylist)

answered Feb 15 '14 at 05:46

Manish

284
1
4
10

Not anywhere near efficient. – DJClayworth Feb 16 '14 at 22:51
@DJClayworth: care to back this up with some profiling? – Paulo Scardine Feb 16 '14 at 23:01

chris · Answer 12 · 2010-03-31T15:08:26.230

-1

int numbers[100000000000] = {...};
int result[100] = {0};
for( int i = 0 ; i < 100000000000 ; i++ )
{
    for( int j = 0 ; j < 100 ; j++ )
    {
         if( numbers[i] > result[j] )
         {
              if( j < 99 )
              {
                  memcpy(result+j+1, result+j, (100-j)*sizeof(int));
              }
              result[j] = numbers[i];
              break;
         }
    }
}

edited Mar 31 '10 at 15:08

answered Mar 31 '10 at 14:54

chris

91
1
1
6

Retrieving the top 100 numbers from one hundred million of numbers

12 Answers12

Linked

Related